Shinkansen Travel Experience Hackathon¶
- July 19 at 6:00 AM - July 22 at 6:00 AM
- Allowed team size: 1-3
- My ranking: 11/35
Problem Statement:¶
The objective of the Shinkansen Travel Experience hackathon competition is to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train. The Japanese railway system for high-speed passenger trains is known for being rapid, reliable and consistent. Using machine-learning techniques, participants were required to ascertain how significantly each of the parameters contribute overall travel experience of passengers. The datasets consist of: a) the on-time performance of the trains along with passenger information is published in a file named ‘Traveldata_train.csv’ and, b) surveys collected from a random sample of travellers from the same population as the travel data and record the travellers' post-travel experiences in the file named ‘Surveydata_train.csv’. The survey data contains feedback on the parameters of the travel experience including overall experience, the target variable. The files are separated into train and test files.
Data Dictionary:¶
Travel Data:¶
- ID: The unique ID of the passenger
- Gender: The gender of the passenger
- Customer_Type: Loyalty type of the passenger
- Age: The age of the passenger
- Type_Travel: Purpose of travel of the passenger
- Travel_Class: The train class that the passenger traveled in
- Travel_Distance: The distance traveled by the passenger
- Departure_Delay_In_Mins: The delay (in minutes) in train departure
- Arrival_Delay_In_Mins: The delay (in_minutes) in train arrival accuracy 0%.
Survey Data:¶
- Column Name: Column description
- ID: The unique ID of the passenger
- Platform_Location: How convenient the location of the platform is for the passenger
- Seat_Class: The type of the seat class on the train
- Overall_Experience: The overall experience of the passenger
- Seat_Comfort: The comfort level of the seat for the passenger
- Arrival_time_Convenient: How convenient the arrival time of the train is for the passenger
- Catering: How convenient the catering service is for the passenger
- Onboard_Wi-Fi_Service: The quality of the onboard Wi_Fi service for the passenger
- Onboard_Entertainment: The quality of the onbaord entertainment for the passenger
- Online_Support: The quality of the online support for the passenger
- Ease_of_Online Booking: The level of ease of booking a trip online
- Onboard_Service: The quality of service onboard for the passenger
- Legroom: The convenience of the legroom provided for the passenger
- Baggage_Handling: The convenience of the handling of baggage for the customer
- CheckIn_Service: The convenience of the check-in service for the passenger
- Cleanliness: The passenger's view of the cleanliness of the service
- Online_Boarding: The convenience of the online boarding process for the passenger
Evaluation Criteria:¶
The evaluation metric is the accuracy score of the model i.e., the percentage of predictions made by the model that turned out to be correct. The score is calculated as the total number of correct predictions, True Positives plus True Negatives, divided by the total number of observations. The highest possible accuracy is 100% (or 1) whilst the worst possible accuracy is 0%. Since the problem is a real-world machine learning classification problem, the benchmark accuracy score is approximately 95.00%. My goal in this project/competition was therefore to achieve a higher score than the benchmark.
Approach to arrive at the insights and recommendations:¶
Importing the necessary libraries
Reading in the dataset to get an overview
Conducting exploratory data analysis - a. Univariate, b. Bi & Multi-variate, c. Answering questions about particular variables of interest
Preparing the data
Defining the performance metric
Building the Machine Learning models, checking the performance and feature importances, tuning the models where necessary and running the predictions
Recording the observations
Comparing the model performances
Choosing the best model for deployment
Summarising the key observations, business insights and recommendations
1. Importing the necessary libraries¶
# Importing the library packages
import pandas as pd # library used for data manipulation and analysis
import numpy as np # library used for working with arrays
import matplotlib.pyplot as plt # library for plots and visualizations
import seaborn as sns # library for visualizations
sns.set
# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')
# Importing machine learning models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Importing additional functions from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
# Importing functions to generate different metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer,recall_score,accuracy_score
2. Reading in the dataset to get an overview¶
# Loading the data set
df_survey = pd.read_csv('Surveydata_train.csv')
df_survey_test = pd.read_csv('Surveydata_test.csv')
df_travel = pd.read_csv('Traveldata_train.csv')
df_travel_test = pd.read_csv('Traveldata_test.csv')
df_survey.head()
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | Ease_of_Online_Booking | Onboard_Service | Legroom | Baggage_Handling | CheckIn_Service | Cleanliness | Online_Boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 98800001 | 0 | Needs Improvement | Green Car | Excellent | Excellent | Very Convenient | Good | Needs Improvement | Acceptable | Needs Improvement | Needs Improvement | Acceptable | Needs Improvement | Good | Needs Improvement | Poor |
| 1 | 98800002 | 0 | Poor | Ordinary | Excellent | Poor | Needs Improvement | Good | Poor | Good | Good | Excellent | Needs Improvement | Poor | Needs Improvement | Good | Good |
| 2 | 98800003 | 1 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Excellent | Excellent | Excellent | Excellent | Excellent | Good | Excellent | Excellent |
| 3 | 98800004 | 0 | Acceptable | Ordinary | Needs Improvement | NaN | Needs Improvement | Acceptable | Needs Improvement | Acceptable | Acceptable | Acceptable | Acceptable | Acceptable | Good | Acceptable | Acceptable |
| 4 | 98800005 | 1 | Acceptable | Ordinary | Acceptable | Acceptable | Manageable | Needs Improvement | Good | Excellent | Good | Good | Good | Good | Good | Good | Good |
df_survey.shape
(94379, 17)
df_survey_test.head()
| ID | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | Ease_of_Online_Booking | Onboard_Service | Legroom | Baggage_Handling | CheckIn_Service | Cleanliness | Online_Boarding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 99900001 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Needs Improvement | Excellent | Good | Excellent | Excellent | Excellent | Excellent | Good | Excellent | Poor |
| 1 | 99900002 | Extremely Poor | Ordinary | Good | Poor | Manageable | Acceptable | Poor | Acceptable | Acceptable | Excellent | Acceptable | Good | Acceptable | Excellent | Acceptable |
| 2 | 99900003 | Excellent | Ordinary | Excellent | Excellent | Very Convenient | Excellent | Excellent | Excellent | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Needs Improvement | Excellent |
| 3 | 99900004 | Acceptable | Green Car | Excellent | Acceptable | Very Convenient | Poor | Acceptable | Excellent | Poor | Acceptable | Needs Improvement | Excellent | Excellent | Excellent | Poor |
| 4 | 99900005 | Excellent | Ordinary | Extremely Poor | Excellent | Needs Improvement | Excellent | Excellent | Excellent | Excellent | NaN | Acceptable | Excellent | Excellent | Excellent | Excellent |
df_survey_test.shape
(35602, 16)
df_travel.head()
| ID | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 98800001 | Female | Loyal Customer | 52.0 | NaN | Business | 272 | 0.0 | 5.0 |
| 1 | 98800002 | Male | Loyal Customer | 48.0 | Personal Travel | Eco | 2200 | 9.0 | 0.0 |
| 2 | 98800003 | Female | Loyal Customer | 43.0 | Business Travel | Business | 1061 | 77.0 | 119.0 |
| 3 | 98800004 | Female | Loyal Customer | 44.0 | Business Travel | Business | 780 | 13.0 | 18.0 |
| 4 | 98800005 | Female | Loyal Customer | 50.0 | Business Travel | Business | 1981 | 0.0 | 0.0 |
df_travel.shape
(94379, 9)
df_travel_test.head()
| ID | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 99900001 | Female | NaN | 36.0 | Business Travel | Business | 532 | 0.0 | 0.0 |
| 1 | 99900002 | Female | Disloyal Customer | 21.0 | Business Travel | Business | 1425 | 9.0 | 28.0 |
| 2 | 99900003 | Male | Loyal Customer | 60.0 | Business Travel | Business | 2832 | 0.0 | 0.0 |
| 3 | 99900004 | Female | Loyal Customer | 29.0 | Personal Travel | Eco | 1352 | 0.0 | 0.0 |
| 4 | 99900005 | Male | Disloyal Customer | 18.0 | Business Travel | Business | 1610 | 17.0 | 0.0 |
df_travel_test.shape
(35602, 9)
I will merge the 'survey' and 'travel data' datasets and then investigate which columns are relevant to our task.
# Creating the train dataset by merging the survey data with the travel data
df_train = pd.merge(df_survey, df_travel.drop_duplicates(['ID']), on="ID", how="left")
df_train.head()
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 98800001 | 0 | Needs Improvement | Green Car | Excellent | Excellent | Very Convenient | Good | Needs Improvement | Acceptable | ... | Needs Improvement | Poor | Female | Loyal Customer | 52.0 | NaN | Business | 272 | 0.0 | 5.0 |
| 1 | 98800002 | 0 | Poor | Ordinary | Excellent | Poor | Needs Improvement | Good | Poor | Good | ... | Good | Good | Male | Loyal Customer | 48.0 | Personal Travel | Eco | 2200 | 9.0 | 0.0 |
| 2 | 98800003 | 1 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Excellent | ... | Excellent | Excellent | Female | Loyal Customer | 43.0 | Business Travel | Business | 1061 | 77.0 | 119.0 |
| 3 | 98800004 | 0 | Acceptable | Ordinary | Needs Improvement | NaN | Needs Improvement | Acceptable | Needs Improvement | Acceptable | ... | Acceptable | Acceptable | Female | Loyal Customer | 44.0 | Business Travel | Business | 780 | 13.0 | 18.0 |
| 4 | 98800005 | 1 | Acceptable | Ordinary | Acceptable | Acceptable | Manageable | Needs Improvement | Good | Excellent | ... | Good | Good | Female | Loyal Customer | 50.0 | Business Travel | Business | 1981 | 0.0 | 0.0 |
5 rows × 25 columns
# Looking at a few observations in the dataset
df_train.head()
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 98800001 | 0 | Needs Improvement | Green Car | Excellent | Excellent | Very Convenient | Good | Needs Improvement | Acceptable | ... | Needs Improvement | Poor | Female | Loyal Customer | 52.0 | NaN | Business | 272 | 0.0 | 5.0 |
| 1 | 98800002 | 0 | Poor | Ordinary | Excellent | Poor | Needs Improvement | Good | Poor | Good | ... | Good | Good | Male | Loyal Customer | 48.0 | Personal Travel | Eco | 2200 | 9.0 | 0.0 |
| 2 | 98800003 | 1 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Excellent | ... | Excellent | Excellent | Female | Loyal Customer | 43.0 | Business Travel | Business | 1061 | 77.0 | 119.0 |
| 3 | 98800004 | 0 | Acceptable | Ordinary | Needs Improvement | NaN | Needs Improvement | Acceptable | Needs Improvement | Acceptable | ... | Acceptable | Acceptable | Female | Loyal Customer | 44.0 | Business Travel | Business | 780 | 13.0 | 18.0 |
| 4 | 98800005 | 1 | Acceptable | Ordinary | Acceptable | Acceptable | Manageable | Needs Improvement | Good | Excellent | ... | Good | Good | Female | Loyal Customer | 50.0 | Business Travel | Business | 1981 | 0.0 | 0.0 |
5 rows × 25 columns
df_train.shape
(94379, 25)
# Checking the data types of the columns in the dataset
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 94379 entries, 0 to 94378 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 94379 non-null int64 1 Overall_Experience 94379 non-null int64 2 Seat_Comfort 94318 non-null object 3 Seat_Class 94379 non-null object 4 Arrival_Time_Convenient 85449 non-null object 5 Catering 85638 non-null object 6 Platform_Location 94349 non-null object 7 Onboard_Wifi_Service 94349 non-null object 8 Onboard_Entertainment 94361 non-null object 9 Online_Support 94288 non-null object 10 Ease_of_Online_Booking 94306 non-null object 11 Onboard_Service 86778 non-null object 12 Legroom 94289 non-null object 13 Baggage_Handling 94237 non-null object 14 CheckIn_Service 94302 non-null object 15 Cleanliness 94373 non-null object 16 Online_Boarding 94373 non-null object 17 Gender 94302 non-null object 18 Customer_Type 85428 non-null object 19 Age 94346 non-null float64 20 Type_Travel 85153 non-null object 21 Travel_Class 94379 non-null object 22 Travel_Distance 94379 non-null int64 23 Departure_Delay_in_Mins 94322 non-null float64 24 Arrival_Delay_in_Mins 94022 non-null float64 dtypes: float64(3), int64(3), object(19) memory usage: 18.0+ MB
Observations:
The column data types indicate that most of the travel experience and survey parameters are of object type. As we saw in the data overview, the categorical nature of the observations will require further pre-processing to convert the data types into a form our ML models can use. We will look at the numerical and catergorical variables in more detail in the EDA section.
# Checking the missing values in each column
df_train.isna().sum()
ID 0 Overall_Experience 0 Seat_Comfort 61 Seat_Class 0 Arrival_Time_Convenient 8930 Catering 8741 Platform_Location 30 Onboard_Wifi_Service 30 Onboard_Entertainment 18 Online_Support 91 Ease_of_Online_Booking 73 Onboard_Service 7601 Legroom 90 Baggage_Handling 142 CheckIn_Service 77 Cleanliness 6 Online_Boarding 6 Gender 77 Customer_Type 8951 Age 33 Type_Travel 9226 Travel_Class 0 Travel_Distance 0 Departure_Delay_in_Mins 57 Arrival_Delay_in_Mins 357 dtype: int64
# Checking the missing values in the data percentage-wise
round(df_train.isnull().sum() / df_train.isnull().count() * 100, 2)
ID 0.00 Overall_Experience 0.00 Seat_Comfort 0.06 Seat_Class 0.00 Arrival_Time_Convenient 9.46 Catering 9.26 Platform_Location 0.03 Onboard_Wifi_Service 0.03 Onboard_Entertainment 0.02 Online_Support 0.10 Ease_of_Online_Booking 0.08 Onboard_Service 8.05 Legroom 0.10 Baggage_Handling 0.15 CheckIn_Service 0.08 Cleanliness 0.01 Online_Boarding 0.01 Gender 0.08 Customer_Type 9.48 Age 0.03 Type_Travel 9.78 Travel_Class 0.00 Travel_Distance 0.00 Departure_Delay_in_Mins 0.06 Arrival_Delay_in_Mins 0.38 dtype: float64
Observations:
The columns with missing values worth noting are:
- Arrival_Time_Convenient 9.46%
- Catering 9.2%
- Onboard_Service 8.05%
- Customer_Type 9.48%
- Type_Travel 9.78%
In the EDA, we will see what the potential impact of the missng values could be on our overall analysis and select the appropriate method to deal with the missing values.
# Checking the number of unique values in each column
df_train.nunique()
ID 94379 Overall_Experience 2 Seat_Comfort 6 Seat_Class 2 Arrival_Time_Convenient 6 Catering 6 Platform_Location 6 Onboard_Wifi_Service 6 Onboard_Entertainment 6 Online_Support 6 Ease_of_Online_Booking 6 Onboard_Service 6 Legroom 6 Baggage_Handling 5 CheckIn_Service 6 Cleanliness 6 Online_Boarding 6 Gender 2 Customer_Type 2 Age 75 Type_Travel 2 Travel_Class 2 Travel_Distance 5210 Departure_Delay_in_Mins 437 Arrival_Delay_in_Mins 434 dtype: int64
Observations from the overview and sanity checks:
The merged dataset has 94379 rows and 25 columns.
There are missing values in the dataset.
The data types contained in the dataset are mainly of object type, 19 consisting of the categorical variables. The continous variables are of type float (3) and int (3).
The ID column contains the unique identifier of each lead. I will drop this column as it will not be useful for the purpose of the analysis.
The age column contains 75 unique observations of the age of the passengers. Since the aim of the project is to predict the overall experience for each passenger based on the criteria inclduing specific age, I will train the models using the age variable as is. For other data science contexts, e.g., clustering etc. it would be appropriate to group the ages into bins i.e., age categories to help with model interpretation.
Similarly, for the other continous variables, I will not group the observations into bins since the models I will be building work best with numeric values.
So far the columns that will require transformation will be those 'object' data type.
# Making a list of all categorical variables in the train dataset
cat_cols=['Overall_Experience','Seat_Comfort','Seat_Class', 'Arrival_Time_Convenient','Catering','Platform_Location',
'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking','Onboard_Service',
'Legroom','Baggage_Handling','CheckIn_Service','Cleanliness','Online_Boarding','Gender','Customer_Type',
'Type_Travel','Travel_Class']
# Converting the data type of each categorical variable to 'category'
for column in cat_cols:
df_train[column]=df_train[column].astype('category')
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 94379 entries, 0 to 94378 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 94379 non-null int64 1 Overall_Experience 94379 non-null category 2 Seat_Comfort 94318 non-null category 3 Seat_Class 94379 non-null category 4 Arrival_Time_Convenient 85449 non-null category 5 Catering 85638 non-null category 6 Platform_Location 94349 non-null category 7 Onboard_Wifi_Service 94349 non-null category 8 Onboard_Entertainment 94361 non-null category 9 Online_Support 94288 non-null category 10 Ease_of_Online_Booking 94306 non-null category 11 Onboard_Service 86778 non-null category 12 Legroom 94289 non-null category 13 Baggage_Handling 94237 non-null category 14 CheckIn_Service 94302 non-null category 15 Cleanliness 94373 non-null category 16 Online_Boarding 94373 non-null category 17 Gender 94302 non-null category 18 Customer_Type 85428 non-null category 19 Age 94346 non-null float64 20 Type_Travel 85153 non-null category 21 Travel_Class 94379 non-null category 22 Travel_Distance 94379 non-null int64 23 Departure_Delay_in_Mins 94322 non-null float64 24 Arrival_Delay_in_Mins 94022 non-null float64 dtypes: category(20), float64(3), int64(2) memory usage: 5.4 MB
Creating a pandas array of the numerical variables
# Creating the array of numerical columns excluding the ID variable
num_cols=['Age','Travel_Distance','Departure_Delay_in_Mins','Arrival_Delay_in_Mins']
Test data¶
# Creating the test dataset by merging the test survey data with the test travel data
df_test = pd.merge(df_survey_test, df_travel_test.drop_duplicates(['ID']), on="ID", how="left")
df_test.head()
| ID | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | Ease_of_Online_Booking | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 99900001 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Needs Improvement | Excellent | Good | Excellent | ... | Excellent | Poor | Female | NaN | 36.0 | Business Travel | Business | 532 | 0.0 | 0.0 |
| 1 | 99900002 | Extremely Poor | Ordinary | Good | Poor | Manageable | Acceptable | Poor | Acceptable | Acceptable | ... | Excellent | Acceptable | Female | Disloyal Customer | 21.0 | Business Travel | Business | 1425 | 9.0 | 28.0 |
| 2 | 99900003 | Excellent | Ordinary | Excellent | Excellent | Very Convenient | Excellent | Excellent | Excellent | Needs Improvement | ... | Needs Improvement | Excellent | Male | Loyal Customer | 60.0 | Business Travel | Business | 2832 | 0.0 | 0.0 |
| 3 | 99900004 | Acceptable | Green Car | Excellent | Acceptable | Very Convenient | Poor | Acceptable | Excellent | Poor | ... | Excellent | Poor | Female | Loyal Customer | 29.0 | Personal Travel | Eco | 1352 | 0.0 | 0.0 |
| 4 | 99900005 | Excellent | Ordinary | Extremely Poor | Excellent | Needs Improvement | Excellent | Excellent | Excellent | Excellent | ... | Excellent | Excellent | Male | Disloyal Customer | 18.0 | Business Travel | Business | 1610 | 17.0 | 0.0 |
5 rows × 24 columns
df_test.shape
(35602, 24)
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35602 entries, 0 to 35601 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 35602 non-null int64 1 Seat_Comfort 35580 non-null object 2 Seat_Class 35602 non-null object 3 Arrival_Time_Convenient 32277 non-null object 4 Catering 32245 non-null object 5 Platform_Location 35590 non-null object 6 Onboard_Wifi_Service 35590 non-null object 7 Onboard_Entertainment 35594 non-null object 8 Online_Support 35576 non-null object 9 Ease_of_Online_Booking 35584 non-null object 10 Onboard_Service 32730 non-null object 11 Legroom 35577 non-null object 12 Baggage_Handling 35562 non-null object 13 CheckIn_Service 35580 non-null object 14 Cleanliness 35600 non-null object 15 Online_Boarding 35600 non-null object 16 Gender 35572 non-null object 17 Customer_Type 32219 non-null object 18 Age 35591 non-null float64 19 Type_Travel 32154 non-null object 20 Travel_Class 35602 non-null object 21 Travel_Distance 35602 non-null int64 22 Departure_Delay_in_Mins 35573 non-null float64 23 Arrival_Delay_in_Mins 35479 non-null float64 dtypes: float64(3), int64(2), object(19) memory usage: 6.5+ MB
df_test.isna().sum()
ID 0 Seat_Comfort 22 Seat_Class 0 Arrival_Time_Convenient 3325 Catering 3357 Platform_Location 12 Onboard_Wifi_Service 12 Onboard_Entertainment 8 Online_Support 26 Ease_of_Online_Booking 18 Onboard_Service 2872 Legroom 25 Baggage_Handling 40 CheckIn_Service 22 Cleanliness 2 Online_Boarding 2 Gender 30 Customer_Type 3383 Age 11 Type_Travel 3448 Travel_Class 0 Travel_Distance 0 Departure_Delay_in_Mins 29 Arrival_Delay_in_Mins 123 dtype: int64
# Making a list of all categorical variables in the test dataset
cat_col_test=['Seat_Comfort','Seat_Class', 'Arrival_Time_Convenient','Catering','Platform_Location',
'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking','Onboard_Service',
'Legroom','Baggage_Handling','CheckIn_Service','Cleanliness','Online_Boarding','Gender','Customer_Type',
'Type_Travel','Travel_Class']
# Converting the data type of each categorical variable to 'category'
for column in cat_col_test:
df_test[column]=df_test[column].astype('category')
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35602 entries, 0 to 35601 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 35602 non-null int64 1 Seat_Comfort 35580 non-null category 2 Seat_Class 35602 non-null category 3 Arrival_Time_Convenient 32277 non-null category 4 Catering 32245 non-null category 5 Platform_Location 35590 non-null category 6 Onboard_Wifi_Service 35590 non-null category 7 Onboard_Entertainment 35594 non-null category 8 Online_Support 35576 non-null category 9 Ease_of_Online_Booking 35584 non-null category 10 Onboard_Service 32730 non-null category 11 Legroom 35577 non-null category 12 Baggage_Handling 35562 non-null category 13 CheckIn_Service 35580 non-null category 14 Cleanliness 35600 non-null category 15 Online_Boarding 35600 non-null category 16 Gender 35572 non-null category 17 Customer_Type 32219 non-null category 18 Age 35591 non-null float64 19 Type_Travel 32154 non-null category 20 Travel_Class 35602 non-null category 21 Travel_Distance 35602 non-null int64 22 Departure_Delay_in_Mins 35573 non-null float64 23 Arrival_Delay_in_Mins 35479 non-null float64 dtypes: category(19), float64(3), int64(2) memory usage: 2.0 MB
# Creating copies of the train and test datasets as backups
data_train = df_train.copy()
data_test = df_test.copy()
3. Conducting exploratory data analysis - a. Univariate, b. Bi / Multi-variate, c. Answering questions about particular variables of interest¶
Approach to EDA:
- Viewing the statistical summaries of the dataset
- Using the continous and categorical variables arrays grouped above
- Univariate analysis
- Bi / Multi-variate analysis
- Observations and providing answers to key business questions
# Checking the summary statistics of the columns with continous observations
df_train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 94379.0 | 9.884719e+07 | 27245.014865 | 98800001.0 | 98823595.5 | 98847190.0 | 98870784.5 | 98894379.0 |
| Age | 94346.0 | 3.941965e+01 | 15.116632 | 7.0 | 27.0 | 40.0 | 51.0 | 85.0 |
| Travel_Distance | 94379.0 | 1.978888e+03 | 1027.961019 | 50.0 | 1359.0 | 1923.0 | 2538.0 | 6951.0 |
| Departure_Delay_in_Mins | 94322.0 | 1.464709e+01 | 38.138781 | 0.0 | 0.0 | 0.0 | 12.0 | 1592.0 |
| Arrival_Delay_in_Mins | 94022.0 | 1.500522e+01 | 38.439409 | 0.0 | 0.0 | 0.0 | 13.0 | 1584.0 |
Observations:¶
The statistical summary shows:
Age has 50th percentile of 40 years, the minimum and maximum are 7 and 85, respectively.
The travel distance minimum is 50 and maximum 6951.
75% of departures were delayed by 12 minutes
75% of arrivals were delayed by 13 minutes
# Checking the summary of categorical variables
df_train.describe(exclude = 'number').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Overall_Experience | 94379 | 2 | 1 | 51593 |
| Seat_Comfort | 94318 | 6 | Acceptable | 21158 |
| Seat_Class | 94379 | 2 | Green Car | 47435 |
| Arrival_Time_Convenient | 85449 | 6 | Good | 19574 |
| Catering | 85638 | 6 | Acceptable | 18468 |
| Platform_Location | 94349 | 6 | Manageable | 24173 |
| Onboard_Wifi_Service | 94349 | 6 | Good | 22835 |
| Onboard_Entertainment | 94361 | 6 | Good | 30446 |
| Online_Support | 94288 | 6 | Good | 30016 |
| Ease_of_Online_Booking | 94306 | 6 | Good | 28909 |
| Onboard_Service | 86778 | 6 | Good | 27265 |
| Legroom | 94289 | 6 | Good | 28870 |
| Baggage_Handling | 94237 | 5 | Good | 34944 |
| CheckIn_Service | 94302 | 6 | Good | 26502 |
| Cleanliness | 94373 | 6 | Good | 35427 |
| Online_Boarding | 94373 | 6 | Good | 25533 |
| Gender | 94302 | 2 | Female | 47815 |
| Customer_Type | 85428 | 2 | Loyal Customer | 69823 |
| Type_Travel | 85153 | 2 | Business Travel | 58617 |
| Travel_Class | 94379 | 2 | Eco | 49342 |
Univariate analysis
Continous data
# Creating the histograms
df_train[num_cols].hist(figsize=(10,8))
plt.show()
Observations:
The age variable tends towards normal distribution with the majority of passenges falling between 20 - 60 years
Travel distance is right-skewed indicating that the majority of the passengers travel distances shorter than the median
Delays in Departures and Arrivals are similarly postiviely distributed
# Defining the hist_box() function that plots a boxplot and histogram in one visual.
def hist_box(df, col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)})
# Adding a graph in each part
sns.boxplot(data=df, x=col, ax=ax_box, showmeans=True)
sns.histplot(data=df, x=col, kde=True, ax=ax_hist)
plt.show()
1.Age¶
hist_box(df_train,'Age')
hist_box(df_train,'Travel_Distance')
# Checking outliers in the Travel Distance column
df_train[df_train['Travel_Distance']>4300]
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51 | 98800052 | 0 | Poor | Ordinary | Extremely Poor | Extremely Poor | Manageable | Poor | Extremely Poor | Poor | ... | Good | Poor | Female | Loyal Customer | 26.0 | Business Travel | Business | 4560 | 0.0 | 7.0 |
| 79 | 98800080 | 0 | Needs Improvement | Green Car | Acceptable | Poor | Manageable | Needs Improvement | Needs Improvement | Needs Improvement | ... | Needs Improvement | Needs Improvement | Male | Loyal Customer | 25.0 | NaN | Business | 5406 | 0.0 | 0.0 |
| 112 | 98800113 | 1 | Good | Ordinary | Good | Good | Convenient | Good | Good | Good | ... | Excellent | Good | Male | Loyal Customer | 26.0 | Business Travel | Business | 4615 | 17.0 | 6.0 |
| 115 | 98800116 | 0 | Acceptable | Green Car | Poor | Poor | Inconvenient | Acceptable | Acceptable | Acceptable | ... | Good | Acceptable | Male | Loyal Customer | 22.0 | Business Travel | Business | 4733 | 0.0 | 2.0 |
| 133 | 98800134 | 0 | Needs Improvement | Green Car | Good | Good | Convenient | Needs Improvement | Needs Improvement | Needs Improvement | ... | Acceptable | Needs Improvement | Female | Loyal Customer | 24.0 | NaN | Business | 5135 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94081 | 98894082 | 1 | Good | Ordinary | Good | Good | Convenient | Excellent | Good | Excellent | ... | Good | Excellent | Male | Loyal Customer | 22.0 | Business Travel | Business | 4439 | 0.0 | 0.0 |
| 94149 | 98894150 | 0 | Needs Improvement | Ordinary | Excellent | Needs Improvement | Convenient | Needs Improvement | Excellent | Needs Improvement | ... | Acceptable | Needs Improvement | Female | Loyal Customer | 25.0 | Personal Travel | Eco | 6655 | 0.0 | 0.0 |
| 94170 | 98894171 | 0 | Poor | Green Car | Good | Good | Convenient | Poor | Needs Improvement | Poor | ... | Acceptable | Poor | Male | Loyal Customer | 31.0 | Business Travel | Business | 4617 | 68.0 | 56.0 |
| 94295 | 98894296 | 0 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | ... | Good | Needs Improvement | Male | Loyal Customer | 30.0 | Business Travel | Business | 4927 | 26.0 | 25.0 |
| 94353 | 98894354 | 1 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Excellent | Excellent | Good | ... | Excellent | Excellent | Female | Loyal Customer | 28.0 | Business Travel | Business | 4645 | 0.0 | 2.0 |
1938 rows × 25 columns
hist_box(df_train,'Departure_Delay_in_Mins')
# Checking outliers in the Departure Delay column
df_train[df_train['Departure_Delay_in_Mins']>20]
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 98800003 | 1 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Excellent | ... | Excellent | Excellent | Female | Loyal Customer | 43.0 | Business Travel | Business | 1061 | 77.0 | 119.0 |
| 14 | 98800015 | 0 | Acceptable | Ordinary | Poor | Poor | Inconvenient | Acceptable | Acceptable | Acceptable | ... | Needs Improvement | Acceptable | Male | Loyal Customer | 33.0 | Business Travel | Business | 1180 | 49.0 | 49.0 |
| 19 | 98800020 | 1 | Excellent | Green Car | Good | Good | Manageable | Good | Good | Good | ... | Excellent | Good | Male | Disloyal Customer | 24.0 | Business Travel | Eco | 1994 | 22.0 | 85.0 |
| 30 | 98800031 | 0 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Good | Acceptable | Excellent | ... | Good | Good | Male | Loyal Customer | 9.0 | NaN | Eco | 2379 | 100.0 | 93.0 |
| 33 | 98800034 | 1 | Excellent | Ordinary | NaN | Excellent | Needs Improvement | Poor | Excellent | Poor | ... | Good | Poor | Male | Disloyal Customer | 22.0 | Business Travel | Business | 2515 | 42.0 | 30.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94354 | 98894355 | 1 | Needs Improvement | Green Car | Good | Needs Improvement | Needs Improvement | Needs Improvement | Good | Good | ... | Excellent | Good | Male | Loyal Customer | 48.0 | Business Travel | Business | 2179 | 65.0 | 54.0 |
| 94359 | 98894360 | 1 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Good | Excellent | Good | ... | Good | Excellent | Female | Loyal Customer | 39.0 | Business Travel | Business | 2418 | 24.0 | 22.0 |
| 94367 | 98894368 | 0 | Acceptable | Ordinary | Good | Acceptable | Inconvenient | Acceptable | Needs Improvement | Needs Improvement | ... | Needs Improvement | Needs Improvement | Male | Loyal Customer | 14.0 | Personal Travel | Business | 2842 | 142.0 | 141.0 |
| 94374 | 98894375 | 0 | Poor | Ordinary | Good | Good | Convenient | Poor | Poor | Poor | ... | Good | Poor | Male | Loyal Customer | 32.0 | Business Travel | Business | 1357 | 83.0 | 125.0 |
| 94378 | 98894379 | 0 | Acceptable | Ordinary | Poor | Acceptable | Manageable | Acceptable | Acceptable | Acceptable | ... | Good | Acceptable | Male | Loyal Customer | 54.0 | NaN | Eco | 2107 | 28.0 | 28.0 |
17655 rows × 25 columns
hist_box(df_train,'Arrival_Delay_in_Mins')
# Checking outliers in the Arrival Delay column
df_train[df_train['Arrival_Delay_in_Mins']>20]
| ID | Overall_Experience | Seat_Comfort | Seat_Class | Arrival_Time_Convenient | Catering | Platform_Location | Onboard_Wifi_Service | Onboard_Entertainment | Online_Support | ... | Cleanliness | Online_Boarding | Gender | Customer_Type | Age | Type_Travel | Travel_Class | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 98800003 | 1 | Needs Improvement | Green Car | Needs Improvement | Needs Improvement | Needs Improvement | Needs Improvement | Good | Excellent | ... | Excellent | Excellent | Female | Loyal Customer | 43.0 | Business Travel | Business | 1061 | 77.0 | 119.0 |
| 13 | 98800014 | 0 | Good | Ordinary | Good | Good | Manageable | Good | Excellent | NaN | ... | Acceptable | Good | Female | Loyal Customer | 47.0 | Personal Travel | Eco | 1100 | 20.0 | 34.0 |
| 14 | 98800015 | 0 | Acceptable | Ordinary | Poor | Poor | Inconvenient | Acceptable | Acceptable | Acceptable | ... | Needs Improvement | Acceptable | Male | Loyal Customer | 33.0 | Business Travel | Business | 1180 | 49.0 | 49.0 |
| 17 | 98800018 | 1 | Excellent | Green Car | Excellent | Excellent | Needs Improvement | Excellent | Excellent | Excellent | ... | Excellent | Excellent | Male | Loyal Customer | 68.0 | Personal Travel | Eco | 3756 | 20.0 | 52.0 |
| 19 | 98800020 | 1 | Excellent | Green Car | Good | Good | Manageable | Good | Good | Good | ... | Excellent | Good | Male | Disloyal Customer | 24.0 | Business Travel | Eco | 1994 | 22.0 | 85.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 94359 | 98894360 | 1 | Acceptable | Green Car | Acceptable | Acceptable | Manageable | Good | Excellent | Good | ... | Good | Excellent | Female | Loyal Customer | 39.0 | Business Travel | Business | 2418 | 24.0 | 22.0 |
| 94367 | 98894368 | 0 | Acceptable | Ordinary | Good | Acceptable | Inconvenient | Acceptable | Needs Improvement | Needs Improvement | ... | Needs Improvement | Needs Improvement | Male | Loyal Customer | 14.0 | Personal Travel | Business | 2842 | 142.0 | 141.0 |
| 94371 | 98894372 | 0 | Poor | Ordinary | Poor | Poor | Inconvenient | Good | Good | Acceptable | ... | Poor | Poor | Female | Loyal Customer | 58.0 | Business Travel | Business | 502 | 0.0 | 30.0 |
| 94374 | 98894375 | 0 | Poor | Ordinary | Good | Good | Convenient | Poor | Poor | Poor | ... | Good | Poor | Male | Loyal Customer | 32.0 | Business Travel | Business | 1357 | 83.0 | 125.0 |
| 94378 | 98894379 | 0 | Acceptable | Ordinary | Poor | Acceptable | Manageable | Acceptable | Acceptable | Acceptable | ... | Good | Acceptable | Male | Loyal Customer | 54.0 | NaN | Eco | 2107 | 28.0 | 28.0 |
18007 rows × 25 columns
Observations:
The mean and median age of the passengers is around 40 years.
The mean and median travel distance are just below 2000, the mean being higher than the median distance. There is a significant proportion of outliers in travel distance, with the longest being around 7000.
The mean and median values of Delays in Departures and Delays in Arrivals are the same respectively, however both variables have signifant outliers that would need further investigation to understand their contribution towards to overall customer experience.
Categorical data
# Setting the figure size for plots generated with seaborn
import seaborn as sns
sns.set(rc={"figure.figsize": (10,4)}) # width = 10, height = 4
sns.set_palette("tab10")
# Overall Experience
sns.countplot(x = df_train['Overall_Experience'])
plt.show()
# Seat_Comfort
sns.countplot(x = df_train['Seat_Comfort'])
plt.show()
# Seat_Class
sns.countplot(x = df_train['Seat_Class'])
plt.show()
# Arrival_Time_Convenient
sns.countplot(x = df_train['Arrival_Time_Convenient'])
plt.show()
# Catering
sns.countplot(x = df_train['Catering'])
plt.show()
# Platform_Location
sns.countplot(x = df_train['Platform_Location'])
plt.show()
# Onboard_Wifi_Service
sns.countplot(x = df_train['Onboard_Wifi_Service'])
plt.show()
# Onboard_Entertainment
sns.countplot(x = df_train['Onboard_Entertainment'])
plt.show()
# Online_Support
sns.countplot(x = df_train['Online_Support'])
plt.show()
# Ease_of_Online_Booking
sns.countplot(x = df_train['Ease_of_Online_Booking'])
plt.show()
# Onboard_Service
sns.countplot(x = df_train['Onboard_Service'])
plt.show()
# Legroom
sns.countplot(x = df_train['Legroom'])
plt.show()
# Baggage_Handling
sns.countplot(x = df_train['Baggage_Handling'])
plt.show()
# CheckIn_Service
sns.countplot(x = df_train['CheckIn_Service'])
plt.show()
# Cleanliness
sns.countplot(x = df_train['Cleanliness'])
plt.show()
# Online_Boarding
sns.countplot(x = df_train['Online_Boarding'])
plt.show()
# Gender
sns.countplot(x = df_train['Gender'])
plt.show()
# Customer_Type
sns.countplot(x = df_train['Customer_Type'])
plt.show()
# Type_Travel
sns.countplot(x = df_train['Type_Travel'])
plt.show()
# Travel_Class
sns.countplot(x = df_train['Travel_Class'])
plt.show()
Observations:
- Overall_experience: The variable of interest has two classes of moderate unequal distribution. Satisfied customers are in the majority.
- Seat_Comfort: Ratings of comfort poor, average, needs improvement and good respectively were around 15000 and more for seat comfort. Ratings of excellent were approximately 13000 are and extremely poor 3000 indicating wide distribution in the variable.
- Seat_Class: Seat classes were equally distributed.
- Arrival_Time_Convenient: There is a wide distribution of reviews for arrival time convenience. Notably, the sum of ratings 'poor' is closest to those for 'just acceptable' and 'needs improvement'.
- Catering: There is a wide distribution of ratings for catering. The majority of them neutral to positive i.e., from 'acceptable' onward.
- Platform_Location: Platform location shows normal distribution of ratings. There are no observations of 'very inconvenient'.
- Onboard_Wifi: The onboard wifi service rating is somewhat equally distributed amongst customers except for about 10% who rated it as 'poor'.
- Onboard_Entertainment: There was wide and unequal distribution in the reviews for onboard entertainment. Altogether, at least 50% customers rated it "good" and 'excellent'.
- Online_support: Similarly as online support, there was more positive sentiment for the level of online support. However around a third of the ratings fell in the 'acceptable' and more negative categories.
- Ease_of_Online_Booking: The distribution of ratings for ease of online booking was similar to that of online support.
- Onboard_Service: The distribution of ratings for onboard service was similar to that of ease of online booking.
- Legroom: The distribution of ratings for legroom was similar to that of onboard service.
- Baggage: The distribution of customer sentiments for baggage handling was more positively skewed.
- Checkin_Service: Checkin service ratings were mostly positive, but ratings for 'acceptable' and lower were just below half of total observations.
- Cleanliness: The distribution of ratings for cleanliness was wide but mostly positive.
- Online_Boarding: The distribution of ratings for online boarding was wide and mostly positive.
- Gender: Gender was nearly equally distributed.
- Customer_Type: There was quite unequal distribution in customer type, 'loyal customer' were in the majority.
- Type_Travel: Around two-thirds or more of customers travelled for business compared to personal travel.
- Travel_Class: The travel class was nearly equally distributed amongst business and eco classes.
Bivariate and Multivariate analysis¶
# Finding and visualising the correlation between the numerical variables using a heatmap
# Plotting the correlation between numerical variables
plt.figure(figsize=(15,8))
sns.heatmap(df_train[num_cols].corr(),annot=True, fmt='0.2f', cmap='YlGnBu');
Checking the relationship between customer sastisfaction i.e., overall experience and the numerical variables
# Mean of numerical variables grouped by status
df_train.groupby(['Overall_Experience'])[num_cols].mean()
| Age | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | |
|---|---|---|---|---|
| Overall_Experience | ||||
| 0 | 37.49018 | 2025.826088 | 17.738600 | 18.392374 |
| 1 | 41.01968 | 1939.962650 | 12.083107 | 12.196763 |
Observations:¶
- Except the for the delay in arrivals and departures, the heatmap does not show any significant correlation between the predictor variables. As a late departure would naturally cause a delay in arrival, the correlation is to be expected.
- The average age of customers was 37 in Class 0 and 41 Class 1.
- There was a ~80 mile diffeence between the average distance traveled by class 0 and class 1 customers.
- The average duration of delays was roughly the same across classes.
# Let us plot the categorical variables.
for i in cat_cols:
if i!='Overall_Experience':
(pd.crosstab(df_train[i],df_train['Overall_Experience'],normalize='index')*100).plot(kind='bar',figsize=(10,4),stacked=True)
plt.ylabel('Overall Experience %')
Observations:¶
- Seat_Comfort: For class 0 i.e., the dissastisfied customers, overall experience was less related to seat comfort compared to satisfied customers. Interestingly, the distribution of ratings 'Excellent' and 'Extremely Poor' were equal for satisfied customers.
- Seat_Class: For both booked green car and ordinary seat classes, 40% of customers were of class 0 and 60% of class 1.
- Arrival_Time_Convenient: Between class 0 and class 1, the distribution of reviews for arrival time convenience were split just below half across the catogories. The parameter showed positive influence on overall experience.
- Catering: For class 0, catering was more negatively associated with overall experience compared to class 1.
- Platform_Location: For class 0 customers, platform location had varying ratings with none being strongly related to overall experience. For class 1 on the other hand, a high count of 'very inconvenient' was observed in spite of the customers being satisifed overall.
- Onboard_Wifi_Service: The onboard wifi service rating was more positively associated with class 1 than class 0.
- Onboard_Entertainment: Class 0 had high counts of negative and neutral sentiments for onboard entertainment in relation to overall experience. Class 1 had mixed ratings, with high counts 'excellent' and 'good' on one hand and 'extremely poor' on the other hand.
- Online_Support: For class 0, overall experience was strongly influenced by online support i.e., in the negative direction, whereas for class 1, the influence on overall experience was more positive.
- Ease_of_Online_Booking: Ease of online booking was more influential on dissatisfation i.e., class 0 customers, with class 1 customers having a more positive overall experience given the parameter.
- Onboard_Service: For class 0, onboard service was more associated with negative sentiment in terms overall experience compared to class 1.
- Legroom: For class 0, ratings for legroom were mostly neutral to negative in terms of overall experience. For class 1, the sentiments varied widely e.g., similar counts of 'excellent' and 'extremely poor' were observed.
- Baggage_Handling: The overall experience for class 0 was neutral to negative in terms of baggage handling. For class 1, the variable had a positive influence on overall experience.
- CheckIn_Service: Checkin service influence on overall experience was largely negative for class 0 and largely positive for class 1.
- Cleanliness: Similarly to checkin service, overall experience of class 0 was more influenced by cleanliness compareed to class 1.
- Online_Boarding: Similarly with cleanliness, the distribution of ratings was wide and mostly positive.
- Gender: In terms of gender, class 0 consisted of more males than females, i.e., females had a more positive overall experience.
- Customer_Type: Class 0 comprised of more disloyal customers compared to class 1, i.e., loyal customers had a more positive overall experience.
- Type_Travel: There was a positive relationship with business travel and overall experience compared to personal travel.
- Travel_Class: Similarly with type of travel, travel class 'business' was more positively linked with overall experience compared to the 'eco' class.
Model Building Approach:¶
- Data preparation
- Partition the data into a train and test set
- Build a model on the train data
- Tune the model if required
- Test the data on the test set
Data Preparation¶
Defining the predictor variables (X) and the target variable (Y)¶
The datasets provided were already split into train and test sets using 70% train and 30% test set stratified sampling. The technique ensures that relative class frequencies are approximately preserved in each train and validation fold.
# Separating the dependent variable and other variables on the train set
X_train=df_train.drop(columns='Overall_Experience')
Y_train=df_train['Overall_Experience']
# Separating the dependent variable and other variables on the test set
X_test=df_test
# Y_test=['Overall_Experience']
Missing Values¶
Earlier we identified that our data has missing values. We will impute missing values using median for continuous variables and mode for categorical variables using SimpleImputer.
The SimpleImputer provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the statistics (mean, median, or most frequent) of each column in which the missing values are located.
X_train.isna().sum()
ID 0 Seat_Comfort 61 Seat_Class 0 Arrival_Time_Convenient 8930 Catering 8741 Platform_Location 30 Onboard_Wifi_Service 30 Onboard_Entertainment 18 Online_Support 91 Ease_of_Online_Booking 73 Onboard_Service 7601 Legroom 90 Baggage_Handling 142 CheckIn_Service 77 Cleanliness 6 Online_Boarding 6 Gender 77 Customer_Type 8951 Age 33 Type_Travel 9226 Travel_Class 0 Travel_Distance 0 Departure_Delay_in_Mins 57 Arrival_Delay_in_Mins 357 dtype: int64
Imputing missing data¶
si1=SimpleImputer(strategy='median')
median_imputed_col=['Age','Departure_Delay_in_Mins','Arrival_Delay_in_Mins']
# Fit and transform the train data
X_train[median_imputed_col]=si1.fit_transform(X_train[median_imputed_col])
#Transform the test data i.e. replace missing values with the median calculated using training data
X_test[median_imputed_col]=si1.transform(X_test[median_imputed_col])
si2=SimpleImputer(strategy='most_frequent')
mode_imputed_col=['Seat_Comfort','Arrival_Time_Convenient','Catering','Platform_Location',
'Onboard_Wifi_Service','Onboard_Entertainment','Online_Support','Ease_of_Online_Booking',
'Onboard_Service','Legroom','Baggage_Handling','CheckIn_Service','Cleanliness',
'Online_Boarding','Gender','Customer_Type','Type_Travel']
# Fit and transform the train data
X_train[mode_imputed_col]=si2.fit_transform(X_train[mode_imputed_col])
# Transform the test data i.e. replace missing values with the mode calculated using training data
X_test[mode_imputed_col]=si2.transform(X_test[mode_imputed_col])
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
ID 0 Seat_Comfort 0 Seat_Class 0 Arrival_Time_Convenient 0 Catering 0 Platform_Location 0 Onboard_Wifi_Service 0 Onboard_Entertainment 0 Online_Support 0 Ease_of_Online_Booking 0 Onboard_Service 0 Legroom 0 Baggage_Handling 0 CheckIn_Service 0 Cleanliness 0 Online_Boarding 0 Gender 0 Customer_Type 0 Age 0 Type_Travel 0 Travel_Class 0 Travel_Distance 0 Departure_Delay_in_Mins 0 Arrival_Delay_in_Mins 0 dtype: int64 ------------------------------ ID 0 Seat_Comfort 0 Seat_Class 0 Arrival_Time_Convenient 0 Catering 0 Platform_Location 0 Onboard_Wifi_Service 0 Onboard_Entertainment 0 Online_Support 0 Ease_of_Online_Booking 0 Onboard_Service 0 Legroom 0 Baggage_Handling 0 CheckIn_Service 0 Cleanliness 0 Online_Boarding 0 Gender 0 Customer_Type 0 Age 0 Type_Travel 0 Travel_Class 0 Travel_Distance 0 Departure_Delay_in_Mins 0 Arrival_Delay_in_Mins 0 dtype: int64
Observations:
- After imputing the missing data, there are no longer any missing values
- One-hot encoding, since there are several categorical observations which contain strings, we will create dummy variables to continue with modelling.**
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_test.shape)
(94379, 79) (35602, 74)
Observations:
- After encoding there are 79 columns in the train data set and 74 columns in the test data set.
Next we will check the number of each sub-feature in the train and datasets to see where they differ
# Checking the index numbers of the columns and their names in the train dataframe and their order
for idx, column_name in enumerate(X_train.columns):
print(f"Column {idx}: {column_name}")
Column 0: ID Column 1: Age Column 2: Travel_Distance Column 3: Departure_Delay_in_Mins Column 4: Arrival_Delay_in_Mins Column 5: Seat_Comfort_Excellent Column 6: Seat_Comfort_Extremely Poor Column 7: Seat_Comfort_Good Column 8: Seat_Comfort_Needs Improvement Column 9: Seat_Comfort_Poor Column 10: Seat_Class_Ordinary Column 11: Arrival_Time_Convenient_Excellent Column 12: Arrival_Time_Convenient_Extremely Poor Column 13: Arrival_Time_Convenient_Good Column 14: Arrival_Time_Convenient_Needs Improvement Column 15: Arrival_Time_Convenient_Poor Column 16: Catering_Excellent Column 17: Catering_Extremely Poor Column 18: Catering_Good Column 19: Catering_Needs Improvement Column 20: Catering_Poor Column 21: Platform_Location_Inconvenient Column 22: Platform_Location_Manageable Column 23: Platform_Location_Needs Improvement Column 24: Platform_Location_Very Convenient Column 25: Platform_Location_Very Inconvenient Column 26: Onboard_Wifi_Service_Excellent Column 27: Onboard_Wifi_Service_Extremely Poor Column 28: Onboard_Wifi_Service_Good Column 29: Onboard_Wifi_Service_Needs Improvement Column 30: Onboard_Wifi_Service_Poor Column 31: Onboard_Entertainment_Excellent Column 32: Onboard_Entertainment_Extremely Poor Column 33: Onboard_Entertainment_Good Column 34: Onboard_Entertainment_Needs Improvement Column 35: Onboard_Entertainment_Poor Column 36: Online_Support_Excellent Column 37: Online_Support_Extremely Poor Column 38: Online_Support_Good Column 39: Online_Support_Needs Improvement Column 40: Online_Support_Poor Column 41: Ease_of_Online_Booking_Excellent Column 42: Ease_of_Online_Booking_Extremely Poor Column 43: Ease_of_Online_Booking_Good Column 44: Ease_of_Online_Booking_Needs Improvement Column 45: Ease_of_Online_Booking_Poor Column 46: Onboard_Service_Excellent Column 47: Onboard_Service_Extremely Poor Column 48: Onboard_Service_Good Column 49: Onboard_Service_Needs Improvement Column 50: Onboard_Service_Poor Column 51: Legroom_Excellent Column 52: Legroom_Extremely Poor Column 53: Legroom_Good Column 54: Legroom_Needs Improvement Column 55: Legroom_Poor Column 56: Baggage_Handling_Excellent Column 57: Baggage_Handling_Good Column 58: Baggage_Handling_Needs Improvement Column 59: Baggage_Handling_Poor Column 60: CheckIn_Service_Excellent Column 61: CheckIn_Service_Extremely Poor Column 62: CheckIn_Service_Good Column 63: CheckIn_Service_Needs Improvement Column 64: CheckIn_Service_Poor Column 65: Cleanliness_Excellent Column 66: Cleanliness_Extremely Poor Column 67: Cleanliness_Good Column 68: Cleanliness_Needs Improvement Column 69: Cleanliness_Poor Column 70: Online_Boarding_Excellent Column 71: Online_Boarding_Extremely Poor Column 72: Online_Boarding_Good Column 73: Online_Boarding_Needs Improvement Column 74: Online_Boarding_Poor Column 75: Gender_Male Column 76: Customer_Type_Loyal Customer Column 77: Type_Travel_Personal Travel Column 78: Travel_Class_Eco
# Checking the index numbers of the columns and their names in the train dataframe and their order
for idx, column_name in enumerate(X_test.columns):
print(f"Column {idx}: {column_name}")
Column 0: ID Column 1: Age Column 2: Travel_Distance Column 3: Departure_Delay_in_Mins Column 4: Arrival_Delay_in_Mins Column 5: Seat_Comfort_Excellent Column 6: Seat_Comfort_Extremely Poor Column 7: Seat_Comfort_Good Column 8: Seat_Comfort_Needs Improvement Column 9: Seat_Comfort_Poor Column 10: Seat_Class_Ordinary Column 11: Arrival_Time_Convenient_Excellent Column 12: Arrival_Time_Convenient_Extremely Poor Column 13: Arrival_Time_Convenient_Good Column 14: Arrival_Time_Convenient_Needs Improvement Column 15: Arrival_Time_Convenient_Poor Column 16: Catering_Excellent Column 17: Catering_Extremely Poor Column 18: Catering_Good Column 19: Catering_Needs Improvement Column 20: Catering_Poor Column 21: Platform_Location_Inconvenient Column 22: Platform_Location_Manageable Column 23: Platform_Location_Needs Improvement Column 24: Platform_Location_Very Convenient Column 25: Onboard_Wifi_Service_Excellent Column 26: Onboard_Wifi_Service_Extremely Poor Column 27: Onboard_Wifi_Service_Good Column 28: Onboard_Wifi_Service_Needs Improvement Column 29: Onboard_Wifi_Service_Poor Column 30: Onboard_Entertainment_Excellent Column 31: Onboard_Entertainment_Extremely Poor Column 32: Onboard_Entertainment_Good Column 33: Onboard_Entertainment_Needs Improvement Column 34: Onboard_Entertainment_Poor Column 35: Online_Support_Excellent Column 36: Online_Support_Good Column 37: Online_Support_Needs Improvement Column 38: Online_Support_Poor Column 39: Ease_of_Online_Booking_Excellent Column 40: Ease_of_Online_Booking_Extremely Poor Column 41: Ease_of_Online_Booking_Good Column 42: Ease_of_Online_Booking_Needs Improvement Column 43: Ease_of_Online_Booking_Poor Column 44: Onboard_Service_Excellent Column 45: Onboard_Service_Good Column 46: Onboard_Service_Needs Improvement Column 47: Onboard_Service_Poor Column 48: Legroom_Excellent Column 49: Legroom_Extremely Poor Column 50: Legroom_Good Column 51: Legroom_Needs Improvement Column 52: Legroom_Poor Column 53: Baggage_Handling_Excellent Column 54: Baggage_Handling_Good Column 55: Baggage_Handling_Needs Improvement Column 56: Baggage_Handling_Poor Column 57: CheckIn_Service_Excellent Column 58: CheckIn_Service_Good Column 59: CheckIn_Service_Needs Improvement Column 60: CheckIn_Service_Poor Column 61: Cleanliness_Excellent Column 62: Cleanliness_Good Column 63: Cleanliness_Needs Improvement Column 64: Cleanliness_Poor Column 65: Online_Boarding_Excellent Column 66: Online_Boarding_Extremely Poor Column 67: Online_Boarding_Good Column 68: Online_Boarding_Needs Improvement Column 69: Online_Boarding_Poor Column 70: Gender_Male Column 71: Customer_Type_Loyal Customer Column 72: Type_Travel_Personal Travel Column 73: Travel_Class_Eco
# Cross-checking the columns
common_columns = ~X_train.columns.isin(X_test.columns).sum()
print(f"Number of common columns: {common_columns}")
Number of common columns: -75
There 75 common columns in the train and test sets. Next we will check for the discrepancies in feature names.
Feature names in the train set and missing in the test set:¶
print(X_train['CheckIn_Service_Extremely Poor'].value_counts()) print(X_train['Cleanliness_Extremely Poor'].value_counts()) print(X_train['Onboard_Service_Extremely Poor'].value_counts()) print(X_train['Online_Support_Extremely Poor'].value_counts()) print(X_train['Platform_Location_Very Inconvenient'].value_counts())
Observations:
- In the train dataset, there are very few observations of the following ratings: 'CheckIn_Service_Extremely Poor', 'Cleanliness_Extremely Poor', 'Onboard_Service_Extremely Poor', 'Online_Support_Extremely Poor', 'Platform_Location_Very Inconvenient'. Therefore, we will drop these columns in and thus have a matching features in the train and test sets.
# Defining the features to be dropped
columns_to_drop = ['ID',
'Platform_Location_Very Inconvenient',
'Online_Support_Extremely Poor',
'Onboard_Service_Extremely Poor',
'CheckIn_Service_Extremely Poor',
'Cleanliness_Extremely Poor']
# Dropping the features in the train set
X_train.drop(columns=columns_to_drop, inplace=True)
X_train.head()
| Age | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | Seat_Comfort_Excellent | Seat_Comfort_Extremely Poor | Seat_Comfort_Good | Seat_Comfort_Needs Improvement | Seat_Comfort_Poor | Seat_Class_Ordinary | ... | Cleanliness_Poor | Online_Boarding_Excellent | Online_Boarding_Extremely Poor | Online_Boarding_Good | Online_Boarding_Needs Improvement | Online_Boarding_Poor | Gender_Male | Customer_Type_Loyal Customer | Type_Travel_Personal Travel | Travel_Class_Eco | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52.0 | 272 | 0.0 | 5.0 | False | False | False | True | False | False | ... | False | False | False | False | False | True | False | True | False | False |
| 1 | 48.0 | 2200 | 9.0 | 0.0 | False | False | False | False | True | True | ... | False | False | False | True | False | False | True | True | True | True |
| 2 | 43.0 | 1061 | 77.0 | 119.0 | False | False | False | True | False | False | ... | False | True | False | False | False | False | False | True | False | False |
| 3 | 44.0 | 780 | 13.0 | 18.0 | False | False | False | False | False | True | ... | False | False | False | False | False | False | False | True | False | False |
| 4 | 50.0 | 1981 | 0.0 | 0.0 | False | False | False | False | False | True | ... | False | False | False | True | False | False | False | True | False | False |
5 rows × 73 columns
# Dropping the features in the test set
X_test.drop(columns='ID', inplace=True)
X_test.head()
| Age | Travel_Distance | Departure_Delay_in_Mins | Arrival_Delay_in_Mins | Seat_Comfort_Excellent | Seat_Comfort_Extremely Poor | Seat_Comfort_Good | Seat_Comfort_Needs Improvement | Seat_Comfort_Poor | Seat_Class_Ordinary | ... | Cleanliness_Poor | Online_Boarding_Excellent | Online_Boarding_Extremely Poor | Online_Boarding_Good | Online_Boarding_Needs Improvement | Online_Boarding_Poor | Gender_Male | Customer_Type_Loyal Customer | Type_Travel_Personal Travel | Travel_Class_Eco | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36.0 | 532 | 0.0 | 0.0 | False | False | False | False | False | False | ... | False | False | False | False | False | True | False | True | False | False |
| 1 | 21.0 | 1425 | 9.0 | 28.0 | False | True | False | False | False | True | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | 60.0 | 2832 | 0.0 | 0.0 | True | False | False | False | False | True | ... | False | True | False | False | False | False | True | True | False | False |
| 3 | 29.0 | 1352 | 0.0 | 0.0 | False | False | False | False | False | False | ... | False | False | False | False | False | True | False | True | True | True |
| 4 | 18.0 | 1610 | 17.0 | 0.0 | True | False | False | False | False | True | ... | False | True | False | False | False | False | True | False | False | False |
5 rows × 73 columns
Model evaluation criterion¶
The model can make two types of wrong predictions:
- Predicting a customer will not be satsified i.e., rate overall experience 0 and the customer rates it 1.
- Predicting a customer will rate overall experience 1 and the customer rates it 0, indicating that they were not satisfied.
Which case is more important?
- The goal of our classification problem is to predict the customers who will rate overall experience 0 or 1. Since customer dissatsifaction represents failure in delivering a good overall experience. Through modelling, the features related to customer satisfaction and dissatisfaction can be uncovered and used as evidence to explore options on how to improve overall customer experience.
- In other words, in our modelling we seek to achieve the highest Accuracy of prediction so as to have a predictive model that can be applied on an ongoing basis to new customer survey and travel data.
# Creating the metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Satisfied', 'Not Satisfied'], yticklabels=['Satisfied', 'Not Satisfied'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
We will be building the following models:
- Logistic Regression
- Support Vector Machine
- Decision Tree
- Random Forest
- AdaBoost
- XGBoost
The best performing model will be recommended for deployment together with the list of important features.orest
1) Logistic Regression¶
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train,Y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
# Checking the performance on the training data
Y_pred_train = lg.predict(X_train)
metrics_score(Y_train, Y_pred_train)
precision recall f1-score support
0 0.85 0.84 0.85 42786
1 0.87 0.88 0.87 51593
accuracy 0.86 94379
macro avg 0.86 0.86 0.86 94379
weighted avg 0.86 0.86 0.86 94379
Observations:
The logistic regression model achieved an accuracy of 87% on the train dataset. The result is well below the benchmark of 95%.
In classification, the class of interest is considered the positive class. In this problem, the class of interest is 1 i.e., the customers who are satisfied and are likely to rate overall experience 1.
Reading the confusion matrix (clockwise):
- True Negative (Actual=0, Predicted=0): Model predicts that a customer rated overall experience 0 and the customer actually rated it 0.
- False Positive (Actual=0, Predicted=1): Model predicts that a customer rated overall experience 1 and the customer actually rated it 0.
- False Negative (Actual=1, Predicted=0): Model predicts that a customer rated overall experience 0 and the customer actually rated it 1.
- True Positive (Actual=1, Predicted=1): Model predicts that a customer rated overall experience 1 and the customer actually rated it 1.
The model clearly fails to identify the majority of customers who would be satisfied.
# Predicting on the test dataset
Y_pred_test = lg.predict(X_test)
# metrics_score(Y_test, Y_pred_test)
In the provided data, there is no validation set for predictions of Y, thus we cannot check the model performance before using it to run the prediction.
Y_pred_test
array([1, 0, 1, ..., 0, 1, 0])
Let's check the coefficients and find which variables are leading to satisfaction:
# Printing the coefficients of logistic regression
cols=X_train.columns
coef_lg=lg.coef_
pd.DataFrame(coef_lg,columns=cols).T.sort_values(by=0,ascending=False)
| 0 | |
|---|---|
| Onboard_Entertainment_Excellent | 2.073871 |
| Seat_Comfort_Excellent | 1.505244 |
| Customer_Type_Loyal Customer | 1.128552 |
| Onboard_Entertainment_Good | 0.965846 |
| Seat_Comfort_Extremely Poor | 0.770817 |
| ... | ... |
| Seat_Comfort_Needs Improvement | -0.628659 |
| Onboard_Entertainment_Poor | -0.649767 |
| Onboard_Entertainment_Needs Improvement | -0.910343 |
| Gender_Male | -1.290781 |
| Travel_Class_Eco | -1.630595 |
73 rows × 1 columns
Observations:¶
According to the logistic regression model, the features that have a the largest positive effect on overall experience are:
- Onboard_Entertainment_Excellent, is the most important feature in determining customer satisfaction.
- Seat_Comfort_Excellent, is the second most important feature.
- Customer_Type_Loyal Customer, is the third most significant feature.
- Onboard_Entertainment_Good, is the fourth most important feature.
- Seat_Comfort_Extremely Poor, is the fifth most significant feature.
The features with the largest negative effect on overall experience are:
- Onboard_Entertainment_Poor
- Seat_Comfort_Needs Improvement
- Onboard_Entertainment_Needs Improvement
- Gender_Male
- Travel_Class_Eco
# Finding the odds
odds = np.exp(lg.coef_[0])
# Adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train.columns, columns=['odds']).sort_values(by='odds', ascending=False)
| odds | |
|---|---|
| Onboard_Entertainment_Excellent | 7.955561 |
| Seat_Comfort_Excellent | 4.505252 |
| Customer_Type_Loyal Customer | 3.091177 |
| Onboard_Entertainment_Good | 2.627009 |
| Seat_Comfort_Extremely Poor | 2.161531 |
| ... | ... |
| Seat_Comfort_Needs Improvement | 0.533306 |
| Onboard_Entertainment_Poor | 0.522168 |
| Onboard_Entertainment_Needs Improvement | 0.402386 |
| Gender_Male | 0.275056 |
| Travel_Class_Eco | 0.195813 |
73 rows × 1 columns
Observations:¶
Having converted the log odds into real odds, we can interpret the results as follows:
- The odds of a customer who rated Onboard Entertainment as Excellent are 12 times the odds of a customer who did not.
- The odds of a customer who rated Seat Comfort as Excellent are ~ 7 times the odds of a customer who did not.
- The odds of a customer who is categorised as a Loyal Customer are ~ 5 times the odds of a customer who was categorised as not being loyal.
- The odds of a customer who rated Onboard_Entertainment as Good are ~3 times the odds of a customer who did not.
- The odds of a customer who rated Seat Comfort as Extremely Poor are ~3 times the odds of a customer who did not.
The features with the largest negative effect on overall experience are:
- Onboard_Entertainment_Poor
- Seat_Comfort_Needs Improvement
- Onboard_Entertainment_Needs Improvement
- Gender_Male
- Travel_Class_Eco
Precision-Recall curve Next we will find the optimal threshold for the model using the Precision-Recall Curve. The Precision-Recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. Thus using the Precision-Recall curve, we can attempt to find a better threshold.
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(Y_train, y_scores_lg[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
Observation:¶
- We can see that precision and recall are balanced for a threshold of ~0.55.
# Calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds_lg)):
if precisions_lg[i]==recalls_lg[i]:
print(thresholds_lg[i])
0.5170822910653909
Observation:¶
- We can see that precision and recall are balanced for a threshold of 0.5311028452145516.
Let's find out the performance of the model at this threshold.
# Checking the performance of the model at the threshold
optimal_threshold=0.5170822910653909
Y_pred_train = lg.predict_proba(X_train)
metrics_score(Y_train, Y_pred_train[:,1]>optimal_threshold)
precision recall f1-score support
0 0.85 0.85 0.85 42786
1 0.87 0.87 0.87 51593
accuracy 0.86 94379
macro avg 0.86 0.86 0.86 94379
weighted avg 0.86 0.86 0.86 94379
Observations:
- After adjusting the logisitic regression model using the optimal threshold of 0.5075361723618196, the accuracy of the model on the train dataset was unchanged at 0.87.
Let's predict on the test data.
optimal_threshold1=0.5170822910653909
Y_pred_test = lg.predict_proba(X_test)
Y_pred_test
array([[0.00468911, 0.99531089],
[0.5404763 , 0.4595237 ],
[0.02241624, 0.97758376],
...,
[0.90386436, 0.09613564],
[0.00496998, 0.99503002],
[0.96424242, 0.03575758]])
2) Support Vector Machines¶
# To Speed-Up SVM training.
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)
Let's build the models using the two of the widely used kernel functions:
- Linear Kernel
- RBF Kernel
2a) Linear Kernel SVM¶
# Fitting SVM
svm = SVC(kernel = 'linear') # Linear kernel or linear decision boundary
model = svm.fit(X = X_train_scaled, y = Y_train)
# Predicting on the train data
y_pred_train_svm = model.predict(X_train_scaled)
# Checking performance on the train data
metrics_score(Y_train, y_pred_train_svm)
precision recall f1-score support
0 0.89 0.90 0.89 42786
1 0.91 0.91 0.91 51593
accuracy 0.90 94379
macro avg 0.90 0.90 0.90 94379
weighted avg 0.90 0.90 0.90 94379
# Predicting on the train data
Y_pred_test_svm = model.predict(X_test_scaled)
# Checking performance on the test data
# metrics_score(Y_test, Y_pred_test_svm)
Y_pred_test_svm
array([1, 1, 1, ..., 0, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_1 = pd.DataFrame(Y_pred_test_svm)
Submission_1 = pd.concat([df_test['ID'],Prediction_1], axis=1)
Submission_1.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_1.to_csv('Hackathon_Submission_1.csv', index = False)
# Checking the dataframe
Submission_1.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Observations:
- With an accuracy of 90%, the SVM model with linear kernel is underperforming relative to the benchmark we are targeting.
2b) RBF Kernel¶
svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,Y_train)
y_scores_svm=svm_rbf.predict_proba(X_train_scaled) # Predict_proba gives the probability of each observation belonging to each class
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(Y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
# Calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds_svm)):
if precisions_svm[i]==recalls_svm[i]:
print(thresholds_svm[i])
0.4009583166478404
optimal_threshold1=0.4009583166478404
Y_pred_train = svm_rbf.predict_proba(X_train_scaled)
metrics_score(Y_train, Y_pred_train[:,1]>optimal_threshold1)
precision recall f1-score support
0 0.96 0.96 0.96 42786
1 0.96 0.96 0.96 51593
accuracy 0.96 94379
macro avg 0.96 0.96 0.96 94379
weighted avg 0.96 0.96 0.96 94379
Y_pred_test = svm_rbf.predict_proba(X_test_scaled)
# metrics_score(Y_test, Y_pred_test[:,1]>optimal_threshold1)
Y_pred_test
array([[4.06454001e-03, 9.95935460e-01],
[1.13842294e-02, 9.88615771e-01],
[7.69186587e-07, 9.99999231e-01],
...,
[5.80178477e-01, 4.19821523e-01],
[4.25033877e-03, 9.95749661e-01],
[9.19224339e-01, 8.07756608e-02]])
# Selecting the probability of the desired class
class_1_pred = Y_pred_test[:, 1]
class_1_pred
array([0.99593546, 0.98861577, 0.99999923, ..., 0.41982152, 0.99574966,
0.08077566])
# Exporting the prediction on the test dataset as .csv
Prediction_2 = pd.DataFrame(class_1_pred)
Submission_2 = pd.concat([df_test['ID'],Prediction_2], axis=1)
Submission_2.columns=['ID', 'Overall_Experience']
# Rounding the 'Overall_Experience' column to 2 decimal places
Submission_2['Overall_Experience'] = Submission_2['Overall_Experience'].round().astype(int)
# Saving the DataFrame as a CSV file
# Submission_2.to_csv('Hackathon_Submission_2.csv', index = False)
# Checking the dataframe
Submission_2.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Observations:
- At the optimal threshold of 0.4009583166478404, the model performance of the SVM with RBG Kernel had a higher accuracy of 96% compared to the linear kernel on the train data set.
3) Decision Tree¶
# Building decision tree model
model_dt= DecisionTreeClassifier(random_state=1,max_depth=8)
model_dt.fit(X_train, Y_train)
DecisionTreeClassifier(max_depth=8, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=8, random_state=1)
Let's check the model performance of decision tree
# Checking performance on the training dataset
pred_train_dt = model_dt.predict(X_train)
metrics_score(Y_train, pred_train_dt)
precision recall f1-score support
0 0.88 0.91 0.90 42786
1 0.92 0.90 0.91 51593
accuracy 0.90 94379
macro avg 0.90 0.90 0.90 94379
weighted avg 0.90 0.90 0.90 94379
Observation:
- The baseline Decision Tree model has an accuracy of 90% on the training set, below our target of 95%
- The performance indicates however that the model is not overfitting on the data, therefore we may get a better accuracy by tuning the parameters of the model.
Predicting on the test data and checking performance
pred_test_dt = model_dt.predict(X_test)
# metrics_score(y_test, pred_test_dt)
pred_test_dt
array([1, 1, 1, ..., 0, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_3 = pd.DataFrame(pred_test_dt)
Submission_3 = pc_df = pd.concat([df_test['ID'],Prediction_3], axis=1)
Submission_3.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
#Submission_3.to_csv('Hackathon_Submission_3.csv', index = False)
# Checking the dataframe
Submission_3.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Let's visualize the decision tree and observe the decision rules:
features = list(X_train.columns)
plt.figure(figsize=(20,20))
from sklearn import tree
tree.plot_tree(model_dt,feature_names=features,max_depth =4, filled=True,fontsize=9,node_ids=True)
plt.show()
Observations:
The root node of the Decision Tree is Onboard_Entertainment_Excellent <= 0.50. This feature results in the highest information gain. Node #1 Onboard_Entertainment_Good <= 0.50 is the second most influential node in terms of information gain. By following the internal nodes, we can trace the model's decision-making operation by tracing the appropriate branches to the respective leaf nodes which contain the final decision of the tree.
# Checking the weights of the decision tree
print(tree.export_text(model_dt, feature_names=X_train.columns.tolist(), show_weights=True))
|--- Onboard_Entertainment_Excellent <= 0.50 | |--- Onboard_Entertainment_Good <= 0.50 | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | |--- Travel_Class_Eco <= 0.50 | | | | | | |--- Ease_of_Online_Booking_Excellent <= 0.50 | | | | | | | |--- Ease_of_Online_Booking_Good <= 0.50 | | | | | | | | |--- weights: [7344.00, 625.00] class: 0 | | | | | | | |--- Ease_of_Online_Booking_Good > 0.50 | | | | | | | | |--- weights: [983.00, 811.00] class: 0 | | | | | | |--- Ease_of_Online_Booking_Excellent > 0.50 | | | | | | | |--- Customer_Type_Loyal Customer <= 0.50 | | | | | | | | |--- weights: [595.00, 20.00] class: 0 | | | | | | | |--- Customer_Type_Loyal Customer > 0.50 | | | | | | | | |--- weights: [186.00, 806.00] class: 1 | | | | | |--- Travel_Class_Eco > 0.50 | | | | | | |--- Travel_Distance <= 920.50 | | | | | | | |--- Type_Travel_Personal Travel <= 0.50 | | | | | | | | |--- weights: [697.00, 104.00] class: 0 | | | | | | | |--- Type_Travel_Personal Travel > 0.50 | | | | | | | | |--- weights: [48.00, 167.00] class: 1 | | | | | | |--- Travel_Distance > 920.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- weights: [5970.00, 600.00] class: 0 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- weights: [15854.00, 306.00] class: 0 | | | | |--- Seat_Comfort_Good > 0.50 | | | | | |--- Arrival_Time_Convenient_Good <= 0.50 | | | | | | |--- Travel_Class_Eco <= 0.50 | | | | | | | |--- Customer_Type_Loyal Customer <= 0.50 | | | | | | | | |--- weights: [25.00, 56.00] class: 1 | | | | | | | |--- Customer_Type_Loyal Customer > 0.50 | | | | | | | | |--- weights: [463.00, 77.00] class: 0 | | | | | | |--- Travel_Class_Eco > 0.50 | | | | | | | |--- Baggage_Handling_Good <= 0.50 | | | | | | | | |--- weights: [346.00, 192.00] class: 0 | | | | | | | |--- Baggage_Handling_Good > 0.50 | | | | | | | | |--- weights: [236.00, 427.00] class: 1 | | | | | |--- Arrival_Time_Convenient_Good > 0.50 | | | | | | |--- Ease_of_Online_Booking_Poor <= 0.50 | | | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | | | |--- weights: [299.00, 1146.00] class: 1 | | | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | | | |--- weights: [69.00, 49.00] class: 0 | | | | | | |--- Ease_of_Online_Booking_Poor > 0.50 | | | | | | | |--- Age <= 35.50 | | | | | | | | |--- weights: [14.00, 11.00] class: 0 | | | | | | | |--- Age > 35.50 | | | | | | | | |--- weights: [43.00, 10.00] class: 0 | | | |--- Seat_Comfort_Excellent > 0.50 | | | | |--- Legroom_Good <= 0.50 | | | | | |--- Onboard_Service_Poor <= 0.50 | | | | | | |--- weights: [0.00, 1283.00] class: 1 | | | | | |--- Onboard_Service_Poor > 0.50 | | | | | | |--- Catering_Good <= 0.50 | | | | | | | |--- weights: [0.00, 47.00] class: 1 | | | | | | |--- Catering_Good > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Legroom_Good > 0.50 | | | | | |--- Ease_of_Online_Booking_Excellent <= 0.50 | | | | | | |--- weights: [0.00, 402.00] class: 1 | | | | | |--- Ease_of_Online_Booking_Excellent > 0.50 | | | | | | |--- Catering_Excellent <= 0.50 | | | | | | | |--- Travel_Class_Eco <= 0.50 | | | | | | | | |--- weights: [24.00, 7.00] class: 0 | | | | | | | |--- Travel_Class_Eco > 0.50 | | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | | | |--- Catering_Excellent > 0.50 | | | | | | | |--- Online_Boarding_Good <= 0.50 | | | | | | | | |--- weights: [1.00, 24.00] class: 1 | | | | | | | |--- Online_Boarding_Good > 0.50 | | | | | | | | |--- weights: [2.00, 4.00] class: 1 | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | |--- Online_Boarding_Extremely Poor <= 0.50 | | | | |--- weights: [0.00, 1880.00] class: 1 | | | |--- Online_Boarding_Extremely Poor > 0.50 | | | | |--- weights: [8.00, 0.00] class: 0 | |--- Onboard_Entertainment_Good > 0.50 | | |--- Catering_Good <= 0.50 | | | |--- Ease_of_Online_Booking_Needs Improvement <= 0.50 | | | | |--- Ease_of_Online_Booking_Poor <= 0.50 | | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | | |--- Online_Boarding_Poor <= 0.50 | | | | | | | |--- Online_Boarding_Needs Improvement <= 0.50 | | | | | | | | |--- weights: [726.00, 12009.00] class: 1 | | | | | | | |--- Online_Boarding_Needs Improvement > 0.50 | | | | | | | | |--- weights: [185.00, 237.00] class: 1 | | | | | | |--- Online_Boarding_Poor > 0.50 | | | | | | | |--- Legroom_Excellent <= 0.50 | | | | | | | | |--- weights: [199.00, 102.00] class: 0 | | | | | | | |--- Legroom_Excellent > 0.50 | | | | | | | | |--- weights: [1.00, 115.00] class: 1 | | | | | |--- Seat_Comfort_Good > 0.50 | | | | | | |--- CheckIn_Service_Excellent <= 0.50 | | | | | | | |--- Cleanliness_Excellent <= 0.50 | | | | | | | | |--- weights: [1041.00, 1450.00] class: 1 | | | | | | | |--- Cleanliness_Excellent > 0.50 | | | | | | | | |--- weights: [48.00, 472.00] class: 1 | | | | | | |--- CheckIn_Service_Excellent > 0.50 | | | | | | | |--- Customer_Type_Loyal Customer <= 0.50 | | | | | | | | |--- weights: [16.00, 23.00] class: 1 | | | | | | | |--- Customer_Type_Loyal Customer > 0.50 | | | | | | | | |--- weights: [34.00, 571.00] class: 1 | | | | |--- Ease_of_Online_Booking_Poor > 0.50 | | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | |--- Legroom_Poor <= 0.50 | | | | | | | | |--- weights: [218.00, 107.00] class: 0 | | | | | | | |--- Legroom_Poor > 0.50 | | | | | | | | |--- weights: [338.00, 6.00] class: 0 | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | |--- weights: [0.00, 54.00] class: 1 | | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | | |--- weights: [0.00, 140.00] class: 1 | | | |--- Ease_of_Online_Booking_Needs Improvement > 0.50 | | | | |--- Seat_Comfort_Needs Improvement <= 0.50 | | | | | |--- Baggage_Handling_Needs Improvement <= 0.50 | | | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | | |--- weights: [179.00, 82.00] class: 0 | | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | | |--- weights: [0.00, 20.00] class: 1 | | | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | | | |--- weights: [0.00, 30.00] class: 1 | | | | | |--- Baggage_Handling_Needs Improvement > 0.50 | | | | | | |--- Legroom_Needs Improvement <= 0.50 | | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | | |--- weights: [85.00, 55.00] class: 0 | | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | | |--- weights: [0.00, 38.00] class: 1 | | | | | | |--- Legroom_Needs Improvement > 0.50 | | | | | | | |--- Cleanliness_Poor <= 0.50 | | | | | | | | |--- weights: [6.00, 434.00] class: 1 | | | | | | | |--- Cleanliness_Poor > 0.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Seat_Comfort_Needs Improvement > 0.50 | | | | | |--- Online_Support_Excellent <= 0.50 | | | | | | |--- Online_Boarding_Excellent <= 0.50 | | | | | | | |--- CheckIn_Service_Excellent <= 0.50 | | | | | | | | |--- weights: [630.00, 48.00] class: 0 | | | | | | | |--- CheckIn_Service_Excellent > 0.50 | | | | | | | | |--- weights: [1.00, 19.00] class: 1 | | | | | | |--- Online_Boarding_Excellent > 0.50 | | | | | | | |--- Arrival_Time_Convenient_Excellent <= 0.50 | | | | | | | | |--- weights: [0.00, 27.00] class: 1 | | | | | | | |--- Arrival_Time_Convenient_Excellent > 0.50 | | | | | | | | |--- weights: [1.00, 1.00] class: 0 | | | | | |--- Online_Support_Excellent > 0.50 | | | | | | |--- Travel_Distance <= 204.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Travel_Distance > 204.50 | | | | | | | |--- Arrival_Time_Convenient_Excellent <= 0.50 | | | | | | | | |--- weights: [0.00, 56.00] class: 1 | | | | | | | |--- Arrival_Time_Convenient_Excellent > 0.50 | | | | | | | | |--- weights: [1.00, 3.00] class: 1 | | |--- Catering_Good > 0.50 | | | |--- Travel_Class_Eco <= 0.50 | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | |--- Arrival_Time_Convenient_Good <= 0.50 | | | | | | |--- Ease_of_Online_Booking_Poor <= 0.50 | | | | | | | |--- Ease_of_Online_Booking_Good <= 0.50 | | | | | | | | |--- weights: [51.00, 143.00] class: 1 | | | | | | | |--- Ease_of_Online_Booking_Good > 0.50 | | | | | | | | |--- weights: [6.00, 171.00] class: 1 | | | | | | |--- Ease_of_Online_Booking_Poor > 0.50 | | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Arrival_Time_Convenient_Good > 0.50 | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | |--- Online_Support_Excellent <= 0.50 | | | | | | | | |--- weights: [305.00, 14.00] class: 0 | | | | | | | |--- Online_Support_Excellent > 0.50 | | | | | | | | |--- weights: [6.00, 12.00] class: 1 | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | |--- Travel_Distance <= 3736.50 | | | | | | | | |--- weights: [1.00, 41.00] class: 1 | | | | | | | |--- Travel_Distance > 3736.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Seat_Comfort_Good > 0.50 | | | | | |--- Customer_Type_Loyal Customer <= 0.50 | | | | | | |--- Age <= 24.50 | | | | | | | |--- Onboard_Service_Poor <= 0.50 | | | | | | | | |--- weights: [17.00, 286.00] class: 1 | | | | | | | |--- Onboard_Service_Poor > 0.50 | | | | | | | | |--- weights: [6.00, 4.00] class: 0 | | | | | | |--- Age > 24.50 | | | | | | | |--- Age <= 30.50 | | | | | | | | |--- weights: [116.00, 223.00] class: 1 | | | | | | | |--- Age > 30.50 | | | | | | | | |--- weights: [251.00, 235.00] class: 0 | | | | | |--- Customer_Type_Loyal Customer > 0.50 | | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | | |--- Type_Travel_Personal Travel <= 0.50 | | | | | | | | |--- weights: [163.00, 2026.00] class: 1 | | | | | | | |--- Type_Travel_Personal Travel > 0.50 | | | | | | | | |--- weights: [56.00, 87.00] class: 1 | | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | | |--- Arrival_Delay_in_Mins <= 0.50 | | | | | | | | |--- weights: [27.00, 23.00] class: 0 | | | | | | | |--- Arrival_Delay_in_Mins > 0.50 | | | | | | | | |--- weights: [30.00, 9.00] class: 0 | | | |--- Travel_Class_Eco > 0.50 | | | | |--- Ease_of_Online_Booking_Good <= 0.50 | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- weights: [209.00, 48.00] class: 0 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- weights: [284.00, 6.00] class: 0 | | | | | | |--- Seat_Comfort_Good > 0.50 | | | | | | | |--- Gender_Male <= 0.50 | | | | | | | | |--- weights: [1117.00, 844.00] class: 0 | | | | | | | |--- Gender_Male > 0.50 | | | | | | | | |--- weights: [1336.00, 502.00] class: 0 | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | |--- Travel_Distance <= 2890.50 | | | | | | | |--- Cleanliness_Good <= 0.50 | | | | | | | | |--- weights: [0.00, 88.00] class: 1 | | | | | | | |--- Cleanliness_Good > 0.50 | | | | | | | | |--- weights: [2.00, 25.00] class: 1 | | | | | | |--- Travel_Distance > 2890.50 | | | | | | | |--- Arrival_Delay_in_Mins <= 28.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Arrival_Delay_in_Mins > 28.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Ease_of_Online_Booking_Good > 0.50 | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | |--- Arrival_Time_Convenient_Good <= 0.50 | | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | | |--- weights: [274.00, 210.00] class: 0 | | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | | |--- weights: [0.00, 20.00] class: 1 | | | | | | |--- Arrival_Time_Convenient_Good > 0.50 | | | | | | | |--- Customer_Type_Loyal Customer <= 0.50 | | | | | | | | |--- weights: [63.00, 41.00] class: 0 | | | | | | | |--- Customer_Type_Loyal Customer > 0.50 | | | | | | | | |--- weights: [238.00, 637.00] class: 1 | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | |--- Age <= 16.50 | | | | | | | | |--- weights: [47.00, 4.00] class: 0 | | | | | | | |--- Age > 16.50 | | | | | | | | |--- weights: [223.00, 103.00] class: 0 | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 |--- Onboard_Entertainment_Excellent > 0.50 | |--- Type_Travel_Personal Travel <= 0.50 | | |--- Customer_Type_Loyal Customer <= 0.50 | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | |--- Travel_Class_Eco <= 0.50 | | | | | |--- Age <= 31.00 | | | | | | |--- Travel_Distance <= 3174.00 | | | | | | | |--- Departure_Delay_in_Mins <= 165.00 | | | | | | | | |--- weights: [7.00, 60.00] class: 1 | | | | | | | |--- Departure_Delay_in_Mins > 165.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Travel_Distance > 3174.00 | | | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Age > 31.00 | | | | | | |--- Online_Support_Good <= 0.50 | | | | | | | |--- Legroom_Needs Improvement <= 0.50 | | | | | | | | |--- weights: [37.00, 7.00] class: 0 | | | | | | | |--- Legroom_Needs Improvement > 0.50 | | | | | | | | |--- weights: [2.00, 4.00] class: 1 | | | | | | |--- Online_Support_Good > 0.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Travel_Class_Eco > 0.50 | | | | | |--- Cleanliness_Excellent <= 0.50 | | | | | | |--- Baggage_Handling_Excellent <= 0.50 | | | | | | | |--- Platform_Location_Inconvenient <= 0.50 | | | | | | | | |--- weights: [110.00, 7.00] class: 0 | | | | | | | |--- Platform_Location_Inconvenient > 0.50 | | | | | | | | |--- weights: [7.00, 5.00] class: 0 | | | | | | |--- Baggage_Handling_Excellent > 0.50 | | | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | | | |--- weights: [5.00, 9.00] class: 1 | | | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- Cleanliness_Excellent > 0.50 | | | | | | |--- Onboard_Wifi_Service_Good <= 0.50 | | | | | | | |--- Legroom_Poor <= 0.50 | | | | | | | | |--- weights: [8.00, 4.00] class: 0 | | | | | | | |--- Legroom_Poor > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Onboard_Wifi_Service_Good > 0.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | |--- Seat_Comfort_Excellent > 0.50 | | | | |--- weights: [0.00, 1318.00] class: 1 | | |--- Customer_Type_Loyal Customer > 0.50 | | | |--- Ease_of_Online_Booking_Poor <= 0.50 | | | | |--- Legroom_Extremely Poor <= 0.50 | | | | | |--- Travel_Class_Eco <= 0.50 | | | | | | |--- Legroom_Poor <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [11.00, 11450.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- weights: [3.00, 235.00] class: 1 | | | | | | |--- Legroom_Poor > 0.50 | | | | | | | |--- Travel_Distance <= 915.50 | | | | | | | | |--- weights: [4.00, 8.00] class: 1 | | | | | | | |--- Travel_Distance > 915.50 | | | | | | | | |--- weights: [0.00, 176.00] class: 1 | | | | | |--- Travel_Class_Eco > 0.50 | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | |--- Departure_Delay_in_Mins <= 129.50 | | | | | | | | |--- weights: [45.00, 584.00] class: 1 | | | | | | | |--- Departure_Delay_in_Mins > 129.50 | | | | | | | | |--- weights: [10.00, 3.00] class: 0 | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | |--- weights: [0.00, 1825.00] class: 1 | | | | |--- Legroom_Extremely Poor > 0.50 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- Ease_of_Online_Booking_Poor > 0.50 | | | | |--- Travel_Class_Eco <= 0.50 | | | | | |--- Travel_Distance <= 272.00 | | | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | |--- Travel_Distance > 272.00 | | | | | | |--- weights: [0.00, 135.00] class: 1 | | | | |--- Travel_Class_Eco > 0.50 | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | | | |--- Onboard_Wifi_Service_Good <= 0.50 | | | | | | | | |--- weights: [15.00, 2.00] class: 0 | | | | | | | |--- Onboard_Wifi_Service_Good > 0.50 | | | | | | | | |--- weights: [2.00, 6.00] class: 1 | | | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | |--- Onboard_Wifi_Service_Good <= 0.50 | | | | | | | |--- weights: [0.00, 42.00] class: 1 | | | | | | |--- Onboard_Wifi_Service_Good > 0.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | |--- Type_Travel_Personal Travel > 0.50 | | |--- Gender_Male <= 0.50 | | | |--- Arrival_Delay_in_Mins <= 131.00 | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | |--- Arrival_Time_Convenient_Excellent <= 0.50 | | | | | | |--- Travel_Distance <= 5727.00 | | | | | | | |--- Departure_Delay_in_Mins <= 128.50 | | | | | | | | |--- weights: [11.00, 2584.00] class: 1 | | | | | | | |--- Departure_Delay_in_Mins > 128.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Travel_Distance > 5727.00 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Arrival_Time_Convenient_Excellent > 0.50 | | | | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | | | | |--- weights: [72.00, 25.00] class: 0 | | | | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | | | | |--- weights: [0.00, 48.00] class: 1 | | | | | | |--- Seat_Comfort_Excellent > 0.50 | | | | | | | |--- weights: [0.00, 739.00] class: 1 | | | | |--- Seat_Comfort_Good > 0.50 | | | | | |--- Arrival_Time_Convenient_Excellent <= 0.50 | | | | | | |--- Platform_Location_Manageable <= 0.50 | | | | | | | |--- Platform_Location_Inconvenient <= 0.50 | | | | | | | | |--- weights: [58.00, 652.00] class: 1 | | | | | | | |--- Platform_Location_Inconvenient > 0.50 | | | | | | | | |--- weights: [29.00, 42.00] class: 1 | | | | | | |--- Platform_Location_Manageable > 0.50 | | | | | | | |--- Departure_Delay_in_Mins <= 49.50 | | | | | | | | |--- weights: [36.00, 44.00] class: 1 | | | | | | | |--- Departure_Delay_in_Mins > 49.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | |--- Arrival_Time_Convenient_Excellent > 0.50 | | | | | | |--- Platform_Location_Very Convenient <= 0.50 | | | | | | | |--- Travel_Distance <= 960.00 | | | | | | | | |--- weights: [48.00, 23.00] class: 0 | | | | | | | |--- Travel_Distance > 960.00 | | | | | | | | |--- weights: [27.00, 35.00] class: 1 | | | | | | |--- Platform_Location_Very Convenient > 0.50 | | | | | | | |--- Online_Support_Excellent <= 0.50 | | | | | | | | |--- weights: [1.00, 18.00] class: 1 | | | | | | | |--- Online_Support_Excellent > 0.50 | | | | | | | | |--- weights: [6.00, 5.00] class: 0 | | | |--- Arrival_Delay_in_Mins > 131.00 | | | | |--- weights: [57.00, 0.00] class: 0 | | |--- Gender_Male > 0.50 | | | |--- Seat_Comfort_Excellent <= 0.50 | | | | |--- Seat_Comfort_Extremely Poor <= 0.50 | | | | | |--- Seat_Comfort_Good <= 0.50 | | | | | | |--- weights: [284.00, 0.00] class: 0 | | | | | |--- Seat_Comfort_Good > 0.50 | | | | | | |--- Onboard_Wifi_Service_Good <= 0.50 | | | | | | | |--- Arrival_Delay_in_Mins <= 5.50 | | | | | | | | |--- weights: [37.00, 27.00] class: 0 | | | | | | | |--- Arrival_Delay_in_Mins > 5.50 | | | | | | | | |--- weights: [25.00, 3.00] class: 0 | | | | | | |--- Onboard_Wifi_Service_Good > 0.50 | | | | | | | |--- Travel_Distance <= 5287.00 | | | | | | | | |--- weights: [40.00, 3.00] class: 0 | | | | | | | |--- Travel_Distance > 5287.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Seat_Comfort_Extremely Poor > 0.50 | | | | | |--- weights: [0.00, 7.00] class: 1 | | | |--- Seat_Comfort_Excellent > 0.50 | | | | |--- weights: [0.00, 462.00] class: 1
# Importance of features in the tree building
feature_names = list(X_train.columns)
importances = model_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 15))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observation:
The baseline model has found the following features to be the top 5 most important influencing overall experience:
- Onboard_Entertainment_Excellent
- Onboard_Entertainment_Good
- Seat_Comfort_Excellent
- Seat_Comfort_Extremely_Poor
- Seat_Comfort_Good
Motivation for tuning the hyperparemeters using GridSearch CV:
To see if the model performance can be improved, we will adjust its hyperparameters using GridSearch CV. The algorithm finds the optimal values for the hyperparameters (e.g., tree depth, minimum samples) increasing the generalization i.e., predictive power and preventing overfitting.
What about pruning the tree?
Since the Decision Tree is not overfitting on the training dataset, pruning the tree could lead to misclassification as it tends to introduce information loss through the removal of nodes or features. The effect being a reduction in the quality of the classifier.
# Choosing the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight='balanced', random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 7),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(accuracy_score)
# Run the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 5)
# Fitting the grid search on the train data
gridCV = gridCV.fit(X_train, Y_train)
# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_
# Fit the best estimator to the data
dtree_estimator.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=6, min_samples_leaf=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', criterion='entropy',
max_depth=6, min_samples_leaf=5, random_state=1)# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(X_train)
metrics_score(Y_train, y_train_pred_dt)
precision recall f1-score support
0 0.82 0.93 0.88 42786
1 0.94 0.84 0.88 51593
accuracy 0.88 94379
macro avg 0.88 0.88 0.88 94379
weighted avg 0.89 0.88 0.88 94379
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(X_test)
# metrics_score(y_test, y_test_pred_dt)
y_test_pred_dt
array([1, 1, 1, ..., 0, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_4 = pd.DataFrame(pred_test_dt)
Submission_4 = pc_df = pd.concat([df_test['ID'],Prediction_4], axis=1)
Submission_4.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_4.to_csv('Hackathon_Submission_4.csv', index = False)
# Checking the dataframe
Submission_4.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Observations:
Compared to the model with the default hyperparameter values, tuning has actually reduced accuracy by 0.02. Another ML model is therefore required to try find better performance.
4) Random Forest¶
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier(random_state = 1)
rf_estimator.fit(X_train, Y_train)
RandomForestClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(Y_train, y_pred_train_rf)
precision recall f1-score support
0 1.00 1.00 1.00 42786
1 1.00 1.00 1.00 51593
accuracy 1.00 94379
macro avg 1.00 1.00 1.00 94379
weighted avg 1.00 1.00 1.00 94379
Observation:
For all the metrics in the training dataset, the Random Forest gives a 100% score on all the metrics. This indicates that the model is overfitting on the train dataset.
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(X_test)
# metrics_score(y_test, y_pred_test_rf)
y_pred_test_rf
array([1, 1, 1, ..., 0, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_5 = pd.DataFrame(y_pred_test_rf)
Submission_5 = pc_df = pd.concat([df_test['ID'],Prediction_5], axis=1)
Submission_5.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_5.to_csv('Hackathon_Submission_5.csv', index = False)
# Checking the dataframe
Submission_5.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Let's check the feature importances of the Random Forest
importances = rf_estimator.feature_importances_
columns = X_train.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (10, 15))
sns.barplot(x = importance_df.Importance, y = importance_df.index)
<Axes: xlabel='Importance'>
Observations:
The Random Forest finds the following features as the top five important in determining overall experience:
- Onboard_Entertainment_Excellent
- Seat_Comfort_Excellent
- Onboard_Entertainment_Good
- Travel_Class_Eco
- Customer_Type_Loyal_Customer
The interpretation is that customers who are predicted to and actually rate overall experience 1 are those who rate the top five important ratings accordingly. It may be further instructive to reduce the training dataset to the most important features e.g., the top 20 - 30 features to make the training more efficient however as this may lead to information loss or overfitting, there may be undesirable costs. Alterntatively we may consider reducing the dimensionality of the data using principal component analysis (PCA) to assist with efficiency. Using PCA is howvever likely to lead to difficulty in interpreting the model results.
Since our project goal is to get the highest possible accuracy, we will tune the random forest hyperparemeters and see if the overfitting seen in the base model is corrected.
# Specifying an alternative Random Forest using random search with some hyperparemeters set
alt_rf_estimator = RandomForestClassifier(n_estimators=220, max_depth=20, max_features=.75, random_state=100)
alt_rf_estimator.fit(X_train, Y_train)
RandomForestClassifier(max_depth=20, max_features=0.75, n_estimators=220,
random_state=100)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=20, max_features=0.75, n_estimators=220,
random_state=100)# Checking performance on the training data
y_pred_train_rf_alt = alt_rf_estimator.predict(X_train)
metrics_score(Y_train, y_pred_train_rf_alt)
precision recall f1-score support
0 0.99 1.00 0.99 42786
1 1.00 0.99 0.99 51593
accuracy 0.99 94379
macro avg 0.99 0.99 0.99 94379
weighted avg 0.99 0.99 0.99 94379
# Checking performance on the testing data
y_pred_test_rf_alt = alt_rf_estimator.predict(X_test)
# metrics_score(y_test, y_pred_test_rf_alt)
y_pred_test_rf_alt
array([1, 1, 1, ..., 1, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_6 = pd.DataFrame(y_pred_test_rf_alt)
Submission_6 = pc_df = pd.concat([df_test['ID'],Prediction_6], axis=1)
Submission_6.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_6.to_csv('Hackathon_Submission_6.csv', index = False)
# Checking the dataframe
Submission_6.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Observations:
The model accuracy is at 99%. This indicates that it may be overfitting on the train dataset, however the metric is better than the base model. We will re-check the feature importances using the adjusted Random Forest model and compare the results.
importances = alt_rf_estimator.feature_importances_
columns = X_train.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (10, 15))
sns.barplot(x = importance_df.Importance, y = importance_df.index)
<Axes: xlabel='Importance'>
Observations:
The base Random Forest found the following order of features as determining overall experience
- Onboard_Entertainment_Excellent
- Seat_Comfort_Excellent
- Onboard_Entertainment_Good
- Travel_Class_Eco
- Customer_Type_Loyal_Customer
The alternative Random Forest with hyperparameters set as above, has found the following order instead:
- Onboard Entertainment Excellent
- Onboard Entertainment Good
- Seat Comfort Excellent
- Seat_Comfort_ Extremely Poor
- Seat Comfort Good
The difference in feature importances shows that even without reducing the training dataset to the most important features or reducing the dimensionality of the data, the random forest performs better i.e., reduced overfitting with adjustment of hyperparemeters.
So far we have used Random search to filter the range for each hyperparameter. To find the specific combinations of hyperparameter settings to try for increased performance, we will use GridSearch CV. It checks all the combinations we set rather than sampling randomly from a distribution.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = 'balanced', random_state = 20)
# Grid of parameters to choose from
params_rf = {
"n_estimators": [200, 300],
"min_samples_leaf": np.arange(1, 4, 1),
"max_features": [0.8, 0.9, 'auto'],
}
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, Y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the tuned model
rf_estimator_tuned.fit(X_train, Y_train)
RandomForestClassifier(class_weight='balanced', max_features=0.8,
n_estimators=300, random_state=20)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_features=0.8,
n_estimators=300, random_state=20)# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
metrics_score(Y_train, y_pred_train_rf_tuned)
precision recall f1-score support
0 1.00 1.00 1.00 42786
1 1.00 1.00 1.00 51593
accuracy 1.00 94379
macro avg 1.00 1.00 1.00 94379
weighted avg 1.00 1.00 1.00 94379
Observations:
The tuned Random Forest Model has returned an accuracy measure of 100% on all the metrics. The model is clearly overfitting on the train data.
# Checking performance on the testing data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
# metrics_score(y_test, y_pred_test_rf)
y_pred_test_rf_tuned
array([1, 1, 1, ..., 1, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_6 = pd.DataFrame(y_pred_test_rf_tuned)
Submission_6 = pc_df = pd.concat([df_test['ID'],Prediction_6], axis=1)
Submission_6.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_6.to_csv('Hackathon_Submission_6.csv', index = False)
# Checking the dataframe
Submission_6.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
5) AdaBoost¶
from sklearn.ensemble import AdaBoostClassifier
# Initializing the AdaBoostClassifier
ada = AdaBoostClassifier(base_estimator=dtree_estimator, n_estimators=50, learning_rate=1.0, random_state=42)
# Fit the model to the training data
ada.fit(X_train, Y_train)
# Predict on the train set
y_pred_ada_train = ada.predict(X_train)
# Evaluate the model
metrics_score(Y_train, y_pred_ada_train)
precision recall f1-score support
0 0.97 0.98 0.97 42786
1 0.98 0.97 0.98 51593
accuracy 0.97 94379
macro avg 0.97 0.98 0.97 94379
weighted avg 0.97 0.97 0.97 94379
Observations: The AdaBoost model is returning an accuracy score of 0.97 on the train data set. This means the predictive power on unseen data should be higher than the benchmark of 95%. Let us run the prediction on the test data.
# Predict on the train set
y_pred_ada_test = ada.predict(X_test)
y_pred_ada_test
array([1, 1, 1, ..., 1, 1, 0], dtype=int64)
# Exporting the prediction on the test dataset as .csv
Prediction_7 = pd.DataFrame(y_pred_ada_test)
Submission_7 = pc_df = pd.concat([df_test['ID'],Prediction_7], axis=1)
Submission_7.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
# Submission_14.to_csv('Hackathon_Submission_14.csv', index = False)
# Checking the dataframe
Submission_7.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Let us use AdaBoost with the Random Forest classifier
# Initializing the AdaBoostClassifier
ada_rf = AdaBoostClassifier(base_estimator=rf_estimator_tuned, n_estimators=75, learning_rate=1.0, random_state=64)
# Fit the model to the training data
ada.fit(X_train, Y_train)
# Predict on the train set
y_pred_ada_train = ada.predict(X_train)
# Evaluate the model
metrics_score(Y_train, y_pred_ada_train)
precision recall f1-score support
0 1.00 1.00 1.00 42786
1 1.00 1.00 1.00 51593
accuracy 1.00 94379
macro avg 1.00 1.00 1.00 94379
weighted avg 1.00 1.00 1.00 94379
Let us tune the hyperparemeters of the model and see if there is an improvement.
# Predict on the test set
y_pred_ada_test = ada.predict(X_test)
y_pred_ada_test
array([1, 1, 1, ..., 1, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_8 = pd.DataFrame(y_pred_ada_test)
Submission_8 = pc_df = pd.concat([X_test['ID'],Prediction_16], axis=1)
Submission_8.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
Submission_8.to_csv('Hackathon_Submission_16.csv', index = False)
# Checking the dataframe
Submission_8.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
6) XGBoost¶
import xgboost as xgb
# Initialize the XGBoost classifier
xgb_clf = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
# Fit the model to the training data
xgb_clf.fit(X_train, Y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=42, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None, random_state=42, ...)# Predict on the train set
y_pred_xg_train = xgb_clf.predict(X_train)
# Evaluate the model
metrics_score(Y_train, y_pred_xg_train)
precision recall f1-score support
0 0.96 0.97 0.97 42786
1 0.98 0.97 0.97 51593
accuracy 0.97 94379
macro avg 0.97 0.97 0.97 94379
weighted avg 0.97 0.97 0.97 94379
# Predict on the test set
y_pred_xg_test = xgb_clf.predict(X_test)
y_pred_xg_test
array([1, 1, 1, ..., 1, 1, 0])
# Exporting the prediction on the test dataset as .csv
Prediction_9 = pd.DataFrame(y_pred_xg_test)
Submission_9 = pc_df = pd.concat([X_test['ID'],Prediction_17], axis=1)
Submission_9.columns=['ID', 'Overall_Experience']
# Saving the DataFrame as a CSV file
Submission_9.to_csv('Hackathon_Submission_17.csv', index = False)
# Checking the dataframe
Submission_9.head()
| ID | Overall_Experience | |
|---|---|---|
| 0 | 99900001 | 1 |
| 1 | 99900002 | 1 |
| 2 | 99900003 | 1 |
| 3 | 99900004 | 0 |
| 4 | 99900005 | 1 |
Conclusions & Recommendations:¶
The Random Forest with tuned hyperparameters, Adaboost with Decision Tree estimator and XGBoost models had the highest levels of accuracy in estimating the overall experience of passengers on the Shinkansen Bullet Train. Since the Random Forest shows possible overfitting compared to the AdaBoost and XGBoost, the latter would be more reliable in a production environment and therefore should be given preference. The tuned random forest model shows the following features as being the most important in determining overall customer experience: Onboard Entertainment Excellent, Onboard Entertainment Good, Seat Comfort Excellent, Seat_Comfort_ Extremely Poor and Seat Comfort Good. Efforts to improve the quality of the customers' experience in these dimensions would therefore have a positive impact on the customer satisfaction.